数据配方显著提升大语言模型长上下文推理能力
阅读原文· arxiv.org该研究提出一种仅需最小化结果导向GRPO设置的数据配方,即可显著提升大语言模型的长上下文推理能力。配方针对检索、多证据合成与推理三类互补任务,构建并筛选8个数据集共约14K样本。在Qwen3-4B、8B及30B-A3B三个模型上,该配方在7项长上下文基准测试中平均分别提升+7.2、+3.2、+6.4分,超越此前强化学习训练集。这些增益可迁移至智能体任务:在已微调的模型上继续训练,使GAIA提升+4.8分、BrowseComp提升+7.0分。数据集将开源。
Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.