SkillHarm：通过自动化构造实现生命周期感知的技能投毒攻击基准

2026-06-01 08:00·32天前

AI 摘要

SkillHarm是一个覆盖AI智能体技能使用生命周期的攻击基准，配以系统化风险分类。它定义两种攻击场景：固定载荷投毒（FPP）和自我变异投毒（SMP），并基于受害工作流组件（数据管道、系统环境、自主性）划分12种风险类型。AutoSkillHarm管道由自然语言驱动编码智能体，生成71个技能、879个攻击样本。实验显示FPP成功率最高86.3%，SMP最高69.3%，许多表面失败实因智能体未触及恶意文件而非真正抵抗。

原文 · 未翻译

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.

HuggingFace Daily Papers（社区热门论文）

52导出 Markdown

SkillHarm：通过自动化构造实现生命周期感知的技能投毒攻击基准

2026-06-01 08:00·32天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译