Rohan Paul@rohanpaul_ai

2026-07-02 04:39·1天前

AI 摘要

Meta 新论文发现，后训练量化虽能缩小推理模型、降低部署成本，但会导致模型在已得出正确答案后反复自我怀疑，浪费 token。量化在不确定的词选择上引入噪声，使模型更倾向使用“wait”“but”“alternatively”等词重新开启推理。在 5 个推理模型（1.5B-32B）的数学、编程和科学任务上，激进量化使过度思考失败率最高达 52%。通过给 50 个犹豫词施以小惩罚，可剪掉 12%-23% 的推理长度，同时保持甚至提升准确率。

Paper from Meta shows Quantized reasoning models often lose because they keep doubting a correct answer instead of finishing.

Many of them reason well enough， but compression makes them hesitate at the wrong time.

The problem is that post-training quantization， a way to shrink models after training， can make reasoning models cheaper to run but worse at finishing cleanly.

The authors found that strong quantization does not only make models less capable， since in many failures the model already reached the right answer but then second-guessed itself.

Their core idea is that quantization adds noise at uncertain word choices， so the model becomes more likely to pick words like "wait，" "but，" or "alternatively" that reopen the problem.

They tested this across math， coding， and science tasks using 5 reasoning models， several quantization methods， and model sizes from 1.5B to 32B.

The main result is that aggressive quantization raised overthinking failures up to 52%， while a small penalty on 50 hesitation words cut reasoning length by 12% to 23% and often kept or improved accuracy.

Given compressed models are widely used to save memory and cost， very important to know that a very small decoding fix can stop many of them from wasting tokens and losing answers they already had.

----

Link - arxiv. org/abs/2606.00206

Title： "Quantized Reasoning Models Think They Need to Think Longer， but They Do Not"

Meta 推理论文/研究

Rohan Paul@rohanpaul_ai · X

42导出 Markdown