BraveGuard:从开放世界威胁到更安全的计算机使用AI智能体防御框架
阅读原文· arxiv.org计算机使用AI智能体将语言模型扩展到与文件、终端、浏览器和外部工具的持续交互,安全风险难以从孤立提示或最终响应检测,因危害在多步执行轨迹中才显现。BraveGuard是一个自进化防御框架,通过挖掘最新研究识别新兴威胁与攻击模式,实例化为可执行任务,收集agent rollout轨迹并推导轨迹级监督信号训练guard模型。训练了Qwen3-Guard和Llama-Guard等多个骨干,在AgentHazard上,平均设置下检测准确率从38.79%提升至82.38%,表明基于开放世界威胁发现和真实agent执行的guard监督能超越固定分类和合成数据,为面对演变风险的计算机使用AI智能体提供自适应防御路径。
Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.