谁在翻转？自我与跨模型反驳揭示LLM答案不稳定性

2026-06-14 08:00·19天前

AI 摘要

针对7个前沿模型、57个MMLU科目的研究发现，模型在被给出针对正确答案的合理反驳后，翻转率介于17.5%至97.3%之间，标准准确率指标无法捕捉稳定性差异。自归因（告知模型这是其先前回答）一致提升翻转率，平均+7.1pp，最高+18.7pp。跨模型池化错误选项论证并选取每道题最有效的反驳，比单一源模型构成更强挑战。基于此构建的MaxFlip挑战集，相比标准自生成挑战进一步将翻转率提升至多+23.6pp。协议、挑战记录和MaxFlip已开源。

原文 · 未翻译

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.

HuggingFace Daily Papers（社区热门论文）

53导出 Markdown

谁在翻转？自我与跨模型反驳揭示LLM答案不稳定性

2026-06-14 08:00·19天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译