# 对Gemma 3大语言模型欺骗检测探针的"压力测试"：性能、鲁棒性与欺骗表征的几何结构

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 12:51
- AIHOT 分数：53
- AIHOT 链接：https://aihot.virxact.com/items/cmpxkya4604y4slckst36wfwo
- 原文链接：https://arxiv.org/abs/2605.27958

## AI 摘要

本文对Gemma 3系列模型（1B-27B参数）上的线性探针欺骗检测方法进行系统测试。研究发现，该探针在干净数据上AUROC可达0.998以上，但在8种文本风格变换下性能崩溃。文章检验了四种欺骗信号的几何编码假设：单一线性方向、多维子空间、凸锥包及熵代理假设，均被拒绝（如单方向假设AUROC仅0.61-0.80）。然而，经过风格数据增强训练的探针（维度k≥5）在未见过的风格上能恢复近乎完美的检测能力（平均AUROC 0.979-0.983），且此模式在4B和27B模型上均成立，表明探针的脆弱性源于训练数据分布狭窄，而非模型规模局限。

## 正文

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.