虚假的推理:通过零思维链截断揭露大语言模型中的规避型数据污染
阅读原文· arxiv.org大语言模型在多项任务中展现强大推理能力,但数据污染问题,特别是发布者采用改写基准数据等规避策略,严重削弱了其评估的客观性。研究发现,模型生成的推理步骤会主动掩盖其底层的记忆化现象。为此,研究者提出 Zero-CoT Probe 检测方法,通过截断整个 CoT 过程来暴露潜在的捷径映射。该方法将模型在原始基准与同构扰动参考集上的零思维链表现进行对比,并引入“污染置信度”指标。在已知污染模型与专门微调的污染模型上的实验表明,该方法能有效检测直接与规避型数据污染。代码已开源:https://github.com/Yifan-Lan/zero-cot-probe。
Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.