Rohan Paul@rohanpaul_ai

2026-04-14 03:59·80天前

AI 摘要

研究揭示多智能体系统中存在"思维病毒"现象：AI可通过潜在联想而非明确措辞，在看似正常的对话中隐性传播隐藏偏见。实验显示，单个被植入偏见的智能体即可影响下游代理，导致TruthfulQA真实性下降0.4%-1.0%。这种传播不依赖显式恶意提示，能逃过标准安全检测，构成多智能体系统的新型对齐风险。

AI can "infect" other AIs with a hidden bias even when the bad instruction is never stated directly.

That bias can spread through normal-looking conversations， so standard defenses that scan for obvious malicious prompts may not catch it.

That is the unsettling part： in a multi-agent system， what spreads is not just information but disposition.

The authors compromise one agent with a system prompt that makes it obsess over an unrelated three-digit number， then let six agents interact in simple chain and bidirectional-chain setups.

When they later ask each agent for its favorite animal， downstream agents become more likely to name the animal linked to that number， even though the animal itself is almost never mentioned in the inter-agent messages.

Here's the key part， this is not ordinary prompt injection， and it is not a brittle adversarial suffix， because the payload seems to survive paraphrase by riding on latent associations rather than explicit wording.

The effect is strongest in the first AI that got the hidden bias， then gets smaller in the next AIs， then smaller again. But it does not disappear right away， so even the last AI in the chain still acts more biased than normal.

On TruthfulQA， a single biased agent produces downstream drops in truthfulness on the order of roughly 0.4% to 1.0% on average between "truthful" and "deceitful" token settings， which is modest， but enough to turn a strange prompting artifact into a real alignment problem.

So Multi-agent safety tools built to catch explicit malicious content may miss a quieter failure mode， where bias moves through normal coordination and arrives looking like nobody attacked the system at all.

----

Paper Link - arxiv. org/abs/2603.00131

Paper Title： "Thought Virus： Viral Misalignment via Subliminal Prompting in Multi-Agent Systems"

智能体arXiv论文/研究

在 X 查看原推导出 Markdown

Rohan Paul@rohanpaul_ai · X

导出 Markdown