研究揭示多智能体系统中存在"思维病毒"现象:AI可通过潜在联想而非明确措辞,在看似正常的对话中隐性传播隐藏偏见。实验显示,单个被植入偏见的智能体即可影响下游代理,导致TruthfulQA真实性下降0.4%-1.0%。这种传播不依赖显式恶意提示,能逃过标准安全检测,构成多智能体系统的新型对齐风险。
AI can "infect" other AIs with a hidden bias even when the bad instruction is never stated directly.
That bias can spread through normal-looking conversations, so standard defenses that scan for obvious malicious prompts may not catch it.
That is the unsettling part: in a multi-agent system, what spreads is not just information but disposition.
The authors compromise one agent with a system prompt that makes it obsess over an unrelated three-digit number, then let six agents interact in simple chain and bidirectional-chain setups.
When they later ask each agent for its favorite animal, downstream agents become more likely to name the animal linked to that number, even though the animal itself is almost never mentioned in the inter-agent messages.
Here's the key part, this is not ordinary prompt injection, and it is not a brittle adversarial suffix, because the payload seems to survive paraphrase by riding on latent associations rather than explicit wording.