Google DeepMind论文首次系统分类六类攻击:HTML注释/白色文本隐藏指令、图像隐写、PDF元数据/演讲者笔记覆写、跨会话内存投毒、目标劫持及多智能体级联攻击。隐藏提示注入在86%场景中部分控制智能体,子智能体劫持成功率58–90%,数据泄露攻击在五种架构中均超80%。内存投毒成功率超80%,仅需不足0.1%数据污染。论文指出网页、邮件等非受信材料可被武器化,构成主要攻击面。
This Google DeepMind's paper is a serious warning for anyone using autonomous agents today.
Gives the first clear taxonomy of 6 attack types where harmful websites can detect AI agents and show them hidden content humans never see, like
- Instructions buried in HTML comments or white-on-white text
- Steganography in image pixels
- Override commands in PDFs, metadata, or even speaker notes
- Memory poisoning that persists across sessions
- Goal hijacking and cross-agent cascades in multi-agent setups
The real security problem for AI agents is not just the model, but the environment it reads.