# 基于智能体规划的物理一致性视频生成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-18 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpc80hvh00aisl6k0lne3ylu
- 原文链接：https://arxiv.org/abs/2605.18396

## AI 摘要

本研究针对视频生成模型频繁违反物理常识的问题展开。分析发现，文本提示作为物理世界的有损压缩，是导致生成结果缺乏物理一致性的根本瓶颈。为此，我们提出NEWTON系统，其核心是将视频生成从独立的系统输出，降级为智能体工具箱中的一个动作。系统通过一个学习型规划器，协调关键帧生成、科学计算等物理感知工具来构建丰富的条件信息，并借助验证器实现闭环迭代优化。在无需修改底层生成模型的前提下，实验表明该系统在VideoPhy-2基准上，将LTX-Video和Veo-3.1模型的联合准确率分别提升了8.3和6.7个百分点，显著增强了视频的物理一致性。

## 正文

Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: https://Newton026.github.io/newton{https://Newton026.github.io/newton}
