# Physics Question Scene Graph：文本到视频生成物理合理性细粒度评估方法

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-24 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqwqyy1k0074slikv18kfu9a
- 原文链接：https://arxiv.org/abs/2606.25306

## AI 摘要

论文提出 Physics Question Scene Graph (PQSG)，一种层级问题图评估方法，利用 VLM 生成带逻辑依赖的问题图，从对象、动作和物理定律三个维度细粒度检查生成视频。为验证方法，构建了 FinePhyEval 数据集，包含来自 Sora 2、Veo 3 和 Wan 2.1 的生成视频及人工标注。PQSG 的细粒度评分与人类判断相关性优于以往方法，且闭源模型物理真实性排名高于 Wan 2.1。此外，FinePhyEval 标注可用于子任务评估：两个强 VLM 能生成类人问题，但回答准确率仍不及人类。

## 正文

Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.