# PhysisForcing：面向机器人操作的物理增强世界模拟器

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-26 08:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmqylfgci02xyslivutm9id0e
- 原文链接：https://arxiv.org/abs/2606.28128

## AI 摘要

视频生成模型常生成物理不合理的操作。PhysisForcing 通过联合优化像素级和语义级特征，重点监督物理信息区域来强化物理一致性，包括像素级轨迹对齐损失和语义级关系对齐损失。在 R-Bench、PAI-Bench 和 EZS-Bench 上，PhysisForcing 一致提升基线模型：Wan2.2-I2V-A14B 和 Cosmos3-Nano 在 R-Bench 分别提升 22.3% 和 9.2%（优于普通微调的 7.1% 和 3.7%），Cosmos3-Nano 变体取得最佳总分。作为 WorldArena 世界模型，闭环成功率从 16.0% 提升至 24.0%，并改善下游策略。

## 正文

Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, including discontinuous motion trajectories and inconsistent robot-object interactions, which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities, particularly during contact. Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations extracted from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, improving the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by 22.3\% and 9.2\% (7.1\% and 3.7\% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0\% to 24.0\% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.