# SCOPE：在可玩环境中模拟跨游戏操作以实现FPS世界模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-22 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpkmzjm808f2sl017ypwb417
- 原文链接：https://arxiv.org/abs/2605.23345

## AI 摘要

针对FPS游戏中高频重叠控制信号的处理难题，SCOPE方法在预训练视频扩散模型的每个Transformer块中插入条件模块。它将特征重塑为逐像素时序序列，使每个位置能基于局部视觉内容计算动作响应，从而无需分割标签即可分离作用域内效应与作用域外生成。同时发布的CrossFPS数据集是首个包含帧对齐动作遥测的多游戏FPS数据集，由7款游戏的69K片段构成，提供10-DoF控制器信号。该模型学习通用的视觉到动作映射，而非游戏特定模式，实现了对未见场景的零样本迁移。实验验证了SCOPE具备强动作响应性、精确作用域分离能力与有效的跨游戏泛化性能。

## 正文

Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.
