通过可追踪轨迹控制学习结构化推理
阅读原文· machinelearning.apple.com大语言模型可涌现推理行为,但复杂推理轨迹在无约束采样中稀疏,标准强化学习难以保证多样性。Ctrl-R框架通过可追踪轨迹控制主动引导rollout,激励探索多样推理模式,并利用重要性采样实现无偏on-policy优化,引入重要性采样权重的幂缩放因子以选择性学习分布外轨迹。实验表明,Ctrl-R在语言和视觉-语言模型的数学推理任务上均取得一致改进。
Learning Structured Reasoning via Tractable Trajectory Control
AuthorsPo-Nien Kung†, Zhen Yang, Jeffrey Luo†, Cheng-Fu Yang†, Haikang Deng†, Zi-Yi Dou†**, Yinfei Yang**, Nanyun Peng†, Zhe Gan, Kai-Wei Chang†
Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., “wait,” indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision–language models on mathematical reasoning tasks.
- † University of California, Los Angeles
- ** Work done while at Apple
Related readings and updates.
Interleaved Reasoning for Large Language Models via Reinforcement Learning
May 28, 2025research area Knowledge Bases and Search, research area Speech and Natural Language Processing
Long chain-of-thought (CoT) significantly enhances large language models’ (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved…
Bootleg: Self-Supervision for Named Entity Disambiguation
June 25, 2021research area Knowledge Bases and Search, research area Speech and Natural Language Processingconference CIDR
A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how to disambiguate entities that appear rarely in the training data, termed tail entities. Humans use subtle reasoning patterns based on knowledge of entity facts, relations, and types to disambiguate unfamiliar entities. Inspired by these patterns, we introduce Bootleg, a self-supervised NED system that is explicitly…

Discover opportunities in Machine Learning.
Our research in machine learning breaks new ground every day.