ThoughtFold：通过内省偏好学习折叠推理链

2026-06-02 19:21·30天前

AI 摘要

Large Reasoning Models (LRMs) 在基于可验证奖励的强化学习（RLVR）下取得进展，但长思维链中的试错和冗余探索被强化，导致过度思考。ThoughtFold 提出细粒度偏好学习框架：通过内省策略识别正确轨迹中的冗余段，生成候选子轨迹谱，并引入掩码偏好优化目标，显式惩罚冗余探索、鼓励模型直接桥接关键推理步骤，从而折叠推理链。在 DeepSeek-R1-Distill-Qwen-7B 上将 token 使用量减少约 56%，同时保持 SOTA 准确率。

原文 · 未翻译

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

HuggingFace Daily Papers（社区热门论文）

65导出 Markdown

ThoughtFold：通过内省偏好学习折叠推理链

2026-06-02 19:21·30天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译

ThoughtFold： 通过内省偏好学习折叠推理链

ThoughtFold： 通过内省偏好学习折叠推理链

ThoughtFold：通过内省偏好学习折叠推理链

ThoughtFold：通过内省偏好学习折叠推理链