大语言模型智能体的冷启动安全性差距

2026-06-05 08:00·28天前

AI 摘要

工具调用 LLM 智能体在对话开始时安全风险最高，完成若干常规 agentic 任务后安全性显著提升，称为冷启动安全性差距。为系统研究此问题，提出基准 SODA（Safety Over Depth for Agents），可控制在安全威胁前最多 20 个前置任务。在 4 个模型族的 7 个模型上，前置任务从 0 增至 20 时安全提升 9–52%。表征分析显示模型隐藏状态逐渐移向安全对齐区域。常规任务本身是安全提升主因，agent 自身响应影响较小但有助于保持效用。在 AgentHarm、Agent Safety Bench 等安全基准及 BFCL、API-Bank 等效用基准上得到验证。建议部署前让 agent 完成少量常规任务以缓解该差距。

原文 · 未翻译

Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9--52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety-aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open-source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API-Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap. Our code is available at https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap

HuggingFace Daily Papers（社区热门论文）

64导出 Markdown

大语言模型智能体的冷启动安全性差距

2026-06-05 08:00·28天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译