# SD-Zero：通过自我修订将二元奖励转化为密集监督

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-14 03:46
- AIHOT 链接：https://aihot.virxact.com/items/cmo1wbeci00aqslbaje64eppl
- 原文链接：https://arxiv.org/abs/2604.12002

## AI 摘要

研究团队提出SD-Zero训练方法，通过让单一模型同时充当生成器和修订者，将二元奖励转化为密集的词元级自我监督。该方法无需外部教师或高质量演示，在数学与代码推理任务中，基于Qwen3-4B-Instruct和Olmo-3-7B-Instruct实现性能提升超10%，训练效率显著优于GRPO等强化学习基线。算法展现出词元级自定位与迭代自我进化特性，修订者能精准识别需修正的关键词元，并持续将修正能力蒸馏回生成器。

## 正文

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.
