# TEMPO：扩展大型推理模型的测试时训练规模

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-21 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo9h6xfu02h5sls2oq99e6l7
- 原文链接：https://arxiv.org/abs/2604.19295

## AI 摘要

TEMPO框架通过期望最大化算法形式化测试时训练过程，交替进行策略优化与周期性critic重新校准，解决了现有方法中自我奖励信号漂移导致的性能瓶颈和多样性崩溃问题。该方法在OLMO3和Qwen3模型家族上验证，使OLMO3-7B在AIME 2024上的准确率从33.0%提升至51.1%，Qwen3-14B从42.3%提升至65.8%，同时保持高生成多样性，实现了测试时计算的有效扩展。

## 正文

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.