# SWITCH：可切换潜在推理框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-11 08:00
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmqac9xuf0k4nslldc7fk5an0
- 原文链接：https://arxiv.org/abs/2606.13106

## AI 摘要

SWITCH利用一对显式边界token（<swi>入口和</swi>出口）将隐藏状态递归块与标准同策略RL（GRPO）兼容。模型通过可见到潜在的课程学习和Switch-GRPO目标训练，在类似规模下一致优于先前隐藏状态递归潜在推理方法。机制分析通过边界token揭示三个发现：入口token是学习到的局部切换策略而非风格化伪影；打开的潜在步骤执行问题特定且因果重要的计算；该计算集中在进入时的单个隐藏状态转换上。表明隐藏状态递归潜在推理既可同策略RL训练也可进行直接机制分析。

## 正文

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits to enter latent mode and to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.