Meta 新论文发现,后训练量化虽能缩小推理模型、降低部署成本,但会导致模型在已得出正确答案后反复自我怀疑,浪费 token。量化在不确定的词选择上引入噪声,使模型更倾向使用“wait”“but”“alternatively”等词重新开启推理。在 5 个推理模型(1.5B-32B)的数学、编程和科学任务上,激进量化使过度思考失败率最高达 52%。通过给 50 个犹豫词施以小惩罚,可剪掉 12%-23% 的推理长度,同时保持甚至提升准确率。
Paper from Meta shows Quantized reasoning models often lose because they keep doubting a correct answer instead of finishing.
Many of them reason well enough, but compression makes them hesitate at the wrong time.
The problem is that post-training quantization, a way to shrink models after training, can make reasoning models cheaper to run but worse at finishing cleanly.
The authors found that strong quantization does not only make models less capable, since in many failures the model already reached the right answer but then second-guessed itself.
Their core idea is that quantization adds noise at uncertain word choices, so the model becomes more likely to pick words like "wait," "but," or "alternatively" that reopen the problem.