压缩-蒸馏:面向高效知识蒸馏的推理轨迹压缩
阅读原文· arxiv.orgQwen3.5-397B-A17B与gpt-oss-120B两教师模型各生成约283k条正确轨迹,经指令微调模型压缩至原始字符长度的8.6–21.0%。压缩轨迹使训练token降至原始的12–30%,训练速度提升2.0–7.6倍,推理输出长度缩短3–19倍。但原始轨迹在所有规模下保持最高下游准确率;压缩学生模型可保留高达96%的原始准确率,同时获得最高18倍的每token效率。在0.8B学生规模使用LoRA时,压缩轨迹缩小了与原始轨迹的差距,但未超过原始。
Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.