# ContextRL：面向智能体与多模态大语言模型的上下文感知强化学习

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-15 08:00
- AIHOT 分数：46
- AIHOT 链接：https://aihot.virxact.com/items/cmql4aq2p011ssllupolxa6gj
- 原文链接：https://arxiv.org/abs/2606.17053

## AI 摘要

ContextRL 是一种上下文感知强化学习方法，通过让模型从两个相似上下文中选出支持查询-答案对的上下文，改善长上下文和多模态细粒度理解。针对代码智能体用轨迹构建 1k 对对比数据，针对多模态推理用图像构建 7k 对。在 5 个长程推理基准上平均提升 +2.2%，在 12 个多模态视觉问答基准上平均提升 +1.8%。与使用相同数据但仅作为标准示例的基线对比，后者几乎无改进，表明增益来自上下文选择目标而非额外数据。

## 正文

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.