# 忠实性指标并不测量忠实性：基于真实标注的元评估

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-24 08:00
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmpm6q01u0m22sl0122xnqdcy
- 原文链接：https://arxiv.org/abs/2605.25052

## AI 摘要

针对大语言模型链式推理（CoT）是否忠实反映其内部计算的评估难题，研究构建了包含3,066个标注CoT的BonaFide基准，覆盖13个任务与10个模型。通过对主流忠实性指标的首次系统性评估发现，大多数指标的表现接近随机水平，存在预测偏差，且在长链推理上性能下降。最佳指标在CoT级别的AUROC仅为0.70，另一指标在步骤级别为0.59，两者均无法跨场景迁移且计算成本高昂。研究揭示了当前忠实性评估的根本性缺陷。

## 正文

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.
