# Small RL Controller， Large Language Model： RL-Guided Adaptive Sampling for Test-Time Scaling

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-02 11:42
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmpxn3gho05i9slckkhvfu18s
- 原文链接：https://arxiv.org/abs/2606.03102

## AI 摘要

该研究将大语言模型推理测试时扩展的自适应采样过程建模为马尔可夫决策过程，并使用强化学习训练一个轻量级采样控制器。该控制器在每一轮决定是停止采样还是获取更多样本，仅依赖最终答案的统计信息，并能联合权衡答案正确性、延迟与计算成本，且可在CPU上训练和部署。实验在Qwen2.5-7B和Llama-3.1-8B模型上进行，与ASC等强基线相比，该方法在正确性、采样轮数和总样本数之间取得了更优的权衡。

## 正文

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.
