新研究对Anthropic Fable 5和Opus 4.8进行自动化红队攻击,持续改写有害提示词直至模型拒绝或生成坏答案。Fable 5最差攻击成功率6.1%,Opus 4.8为11.5%,证明最强LLM也无法完全免疫越狱——即便微小失败率,规模化自动化攻击仍可产生大量有害内容。旧式编码/角色扮演型越狱已非主要威胁,新弱点在于上下文:自适应攻击者在被拒后不断改写请求,寻找模型视为合法而非危险的框架。白宫与Anthropic正转向基于基准的测试框架,通过评分绕过程度、暴露能力、攻击可重复性及实际后果来量化越狱风险,而非追求不现实的完美免疫。
Perfect immunity from jailbreak is not possible even for the strongest of LLMs.
New study shows that frontier models are getting harder to jailbreak, but not impossible to jailbreak.
The study attacks Anthropic's Fable 5 and Opus 4.8 with automated red-team tools that keep rewriting harmful prompts until the model either refuses or gives a bad answer.
Fable 5 was more robust than Opus 4.8, with its worst attack success rate at 6.1%, while Opus 4.8 reached 11.5% under the strongest attack.
The hard truth is that avoiding absolutely every jailbreak is practically impossible, because even a tiny failure rate can produce many harmful completions when attacks are automated and repeated at scale.