自我评估已然存在：用极少数据激发基座大模型的潜在评判校准能力

2026-06-03 08:00·30天前

AI 摘要

研究发现，基座大语言模型未经针对性训练，仅凭少量样本提示即可预测外部评判者的多属性质量分数，效果显著高于随机。Self-Evaluation Elicitation（SEE）方法分两阶段激发该能力：先通过校准耦合的强化学习改进答案并预测评判者，再以掩码蒸馏精炼预测而不改动答案。仅用160个示例（比强化学习基线少约31倍），SEE就在三个基准上提升留出校准并保持答案质量。该自我评估集中在模型自身的token分布，对未训练过的评判者表现稳定，表明其捕捉的是可迁移的质量概念而非单一评判者偏好。

原文 · 未翻译

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

HuggingFace Daily Papers（社区热门论文）

47导出 Markdown

自我评估已然存在：用极少数据激发基座大模型的潜在评判校准能力

2026-06-03 08:00·30天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译