# LatentOmni：通过统一的音视频潜在推理重新思考全模态理解

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-21 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpggte670fbcsljwa8vg7q7a
- 原文链接：https://arxiv.org/abs/2605.22012

## AI 摘要

当前多模态大语言模型在音视频联合推理中存在局限，因其将连续信号压缩为离散文本，损害了时序定位能力。为此，研究提出LatentOmni框架，构建统一的潜在空间以保留密集的感官信息，并交错执行文本推理与音视频潜态更新。该方法引入特征级监督以对齐推理状态与感官特征，并利用Omni-Sync位置嵌入维持音视频潜态的时序一致性。同时，构建了包含3.5万条轨迹的LatentOmni-Instruct-35K数据集。实验证明，LatentOmni在多个基准测试中取得了开源模型的最佳性能，并优于显式文本链式推理基线。

## 正文

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose LatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
