# 可恢复思维程序：基于检查点修复的RePoT方法

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmpr0qtg109vvslnof2b9et8o
- 原文链接：https://arxiv.org/abs/2605.30052

## AI 摘要

RePoT是一种确定性验证重放方法，用于修复思维程序推理中产生的无效动作。当生成的Python轨迹出现无效状态转换时，它会回溯到已验证的前缀状态，并通过一次额外的大语言模型调用来恢复推理。在PuzzleZoo-775基准测试中，RePoT比PoT高出+3至+11个百分点，并在gpt-5.4-mini-medium上达到96.9%的准确率。可控恢复基准Derail-550的实验表明，检查点信息是关键的恢复信号。初步提出的自适应RePoT通过规则调度器在修复与重试间进行选择。

## 正文

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.
