Future-L1：用于视频事件预测的交错潜在视觉推理

2026-06-04 08:00·29天前

AI 摘要

Future-L1 是一种交错潜在视觉推理框架，让 MLLM 在自回归解码中交替生成语言 token 与连续潜在视觉 span。为此构建了 Future-L1-50K 数据集，并用潜在感知 RL 目标 LA-DAPO 优化采样轨迹。在 FutureBench 上，Future-L1 将 Qwen3-VL-8B 得分从 61.0 提升至 85.4，超过此前最优 Video-CoE 10.4 分；在 TwiFF-Bench 上平均分从 2.44 升至 3.04。结果表明，将中间视觉语义保留在潜在空间而非转化为文本，有益于未来视频推理。

原文 · 未翻译

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

HuggingFace Daily Papers（社区热门论文）

69导出 Markdown

Future-L1：用于视频事件预测的交错潜在视觉推理

2026-06-04 08:00·29天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译