多模态连续推理：非对称互变分学习

2026-07-01 08:00·1天前

AI 摘要

多模态大语言模型受语言空间瓶颈限制，连续潜在推理虽能绕过离散token的感知损失，但存在训练-推理不匹配：训练时后验利用答案捷径，迫使推理时先验模仿包含不可用信息的后验，导致性能下降。提出非对称互变分学习（AMVL）框架，通过双向KL校准解决——前向KL训练先验匹配后验，反向KL正则化后验防止崩溃至推理不兼容区域，缓解“答案泄露”。理论分析将后验污染形式化为先验污染，证明双KL目标可降低污染。在latent-integrated MLLM上，AMVL在复杂BLINK基准平均提升+10.83，单项推理任务最高提升+32.00，潜在空间稳定性得到改善。

原文 · 未翻译

Multimodal Large Language Models (MLLMs) are often constrained by a language-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alternative is continuous latent reasoning, where the goal is to discover implicit reasoning pathways that bridge the multimodal query and the final answer. However, this introduces a severe train-inference mismatch: a training-time posterior, conditioned on the ground-truth answer, can exploit answer-dependent shortcuts. Standard variational training then forces the inference-time prior to mimic a posterior that has access to information unavailable at test time, leading to poor performance. To address this, we propose Asymmetric Mutual Variational Learning (AMVL), a framework that resolves this mismatch via a bidirectional calibration objective. A forward KL divergence trains the target-agnostic prior to match the posterior, while a novel reverse KL divergence simultaneously regularizes the posterior, preventing it from collapsing into inference-incompatible regions and mitigating this ``answer leakage''. We provide theoretical analysis formalizing this leakage as prior contamination and prove that our dual-KL objective reduces it. We instantiate AMVL in a latent-integrated MLLM and show that it consistently outperforms strong discrete and latent-reasoning baselines, improving the average score on the complex BLINK benchmark by +10.83 and achieving gains of up to +32.00 on individual reasoning tasks, with analyses confirming improved latent-space stability.

HuggingFace Daily Papers（社区热门论文）

54导出 Markdown

多模态连续推理：非对称互变分学习

2026-07-01 08:00·1天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译