# DyCo-RL： 动态跨模态协调用于视觉推理

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-06 08:00
- AIHOT 分数：54
- AIHOT 链接：https://aihot.virxact.com/items/cmqaa49oc0jk0slldyxcabxm5
- 原文链接：https://arxiv.org/abs/2606.08035

## AI 摘要

强化学习与可验证奖励（RLVR）是增强多模态大语言模型视觉推理的主流范式，但现有方法只优化结果，忽略生成中的细粒度跨模态协调。token级分析显示，模型在链式推理中无法动态交替提取视觉证据与合成文本上下文，导致推理失败。为此提出DyCo-RL，将动态跨模态协调融入RLVR优化：利用Fisher-Rao测地距离测量模态内注意力转移，为token分配视觉或文本功能角色，基于实际注意力与角色对齐度进行优势重加权。DyCo-RL在Qwen2.5-VL-3B/7B上应用，一致改进四种代表性RLVR算法，在七个视觉中心与数学推理基准上取得提升。

## 正文

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.
