# 当行为安全评估失效时：一种表征层面的视角

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-06 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmq8fl1fn021sslld9qvihwsp
- 原文链接：https://arxiv.org/abs/2606.08044

## AI 摘要

大语言模型（LLM）安全评估通常局限于行为层面，难以反映内部鲁棒性。论文形式化“审计差距”——行为安全与干预下鲁棒性之间的差异。通过构建分离模型（保持安全行为但潜在空间脆弱），提出基于干预的评估框架，包括有害微调与逐层潜在扰动，并设计潜在脆弱性得分（LVS）衡量界限扰动下有害行为的可诱导性。在多个安全与未安全对齐的SOTA模型上验证，分离模型在有害干预下LVS显著升高，中间表征对干预最敏感。结论表明仅依赖行为安全评估无法全面刻画模型鲁棒性，需结合表征感知审计。

## 正文

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.