A Benchmark for Modeling Violation-of-Expectation in Physical Reasoning Across Event Categories - 42Papers