Questions for the seminar paper "Physical principles in video diffusion models".
-----------------------------------------------------------------------------------------------
Please send your answers to: sahayr@cs.uni-freiburg.de

1) The paper notes that multiframe models generally perform better than i2v variants. Beyond the obvious advantage of temporal information, what deeper reasons related to the models' internal representations or learning mechanisms might explain this consistent performance gap in predicting future physical events? (2-4 sentences)
2) What is the paper's main conclusion about the relationship between "visual realism" and "physical understanding" in current generative video models? (1-2 sentences)
3) According to the paper, what are the significant limitations of current models regarding physical understanding, and what does this imply for future research in generative AI? (2-3 sentences)