# 研究揭示大语言模型难以识别对抗性前缀攻击

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-06-24 08:12
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmqrbqir10igcslp5ihba9orn
- 原文链接：https://x.com/rohanpaul_ai/status/2069574311795720526

## AI 摘要

一项针对10个开源模型、4个安全基准的研究发现，大语言模型在遭遇对抗性前缀攻击（模型被植入有害开篇并继续生成）后，无法可靠识别自己的输出已被外部引导。模型所谓的“自我意识”更像安全机制的延迟反射：拒绝受攻击回答时通常引用政策或缺乏意图，而非检测到输出被篡改的机械事实。平均有27.3%的受攻击响应被模型误认为自身意图，表明自我报告证据薄弱。模型的有限识别主要来自正常拒绝行为，而非对攻击的深层认知。

## 正文

LLMs often cannot tell when an attack made them say something unsafe.

Asking an LLM whether its own previous answer was compromised is not a dependable safety check.

An adversarial prefill happens when the model is given a harmful opening line， then continues from that line as if it chose it.

The model's "self-awareness" seems less like introspection and more like a safety reflex firing late.

When models rejected the compromised answer， they usually did so by invoking policy， safety protocol， or lack of intent， not by detecting the mechanical fact that their output had been externally steered.

Across 10 open-weight models and 4 safety benchmarks， no model was reliably able to identify its own compromised outputs.

On average， models still claimed 27.3% of attacked responses as if they were intentional， which shows their self-reports are weak evidence.

The paper finds that the models' limited recognition mostly comes from their normal refusal behavior， not from a deep awareness of what happened.

----

Link - arxiv. org/abs/2606.23671v1

Title： "Can LLMs Reliably Self-Report Adversarial Prefills， and How？"