The Decoder：AI News（RSS）

Google Deepmind 发布 AI 控制路线图：将 AI 智能体视为内部威胁，基于行为逐步授权

2026-06-19 02:07·2天前·Matthias Bastian

AI 摘要

Google Deepmind 发布“AI 控制路线图”，将内部 AI 智能体视为潜在内部威胁，类比驾驶教练的双重控制，基于已验证行为逐步授予权限。框架融合 MITRE ATT&CK 网络安全框架，通过可信 AI 监督系统监控活跃智能体推理与行动，预防系统在危害发生前阻断。在 100 万个编码任务测试中，发现大多数问题源于过度执行用户指令而非恶意意图。监控已作为 Gemini Spark 智能体的实时监控运行，可捕捉意外数据删除。安全措施划分为四个检测级别（D1–D4）和三个响应级别（R1–R3）。Deepmind 另发布政策论文《AI 智能体安全三层架构》，覆盖单体、多智能体及生态系统，警告全球安全标准窗口期正在关闭。

原文 · 未翻译

Google Deepmind treats its own AI agents like rogue employees with office keys

Matthias Bastian View the LinkedIn Profile of Matthias Bastian

Jun 18, 2026

Nano Banana Pro prompted by THE DECODER

Key Points

Google Deepmind's new "AI Control Roadmap" doesn't blindly trust AI agents. Instead, the framework treats them as potential insider threats and grants permissions step by step based on verified behavior.
An internal analysis of one million coding tasks found that most flagged issues stem from overzealous agents, not malicious intent.
Deepmind warns the window for establishing global safety standards for AI agent systems is closing fast.

Rather than assuming alignment always works, the company built a safety framework that plans for the worst case: AI agents that go off script.

Deepmind's AI Control Roadmap details how Google locks down its most advanced internal AI systems. The company thinks it could work as a blueprint for the rest of the industry, too.

The framework assumes that a highly capable AI agent might not share its operators' goals and plans accordingly. Deepmind compares it to a driving instructor with dual controls: The instructor trusts the student but keeps a hand near the wheel and a foot near the brakes. Same idea here. AI agents only get permissions based on verified behavior, and trust builds gradually through controlled access.

Deepmind models its AI agents as insider threats

Deepmind treats its internal AI agents like employees who already have office access but might work against the company's interests. The framework builds on the MITRE ATT&CK framework, a well-established cybersecurity tool that breaks potential attacks down into individual tactics and techniques. That lets Deepmind track risks systematically, spot suspicious behavior early, and test defenses in controlled exercises.

Several components work together. A new threat model for AI sits alongside a detection system where trusted AI systems act as "supervisors," watching the reasoning and actions of active agents. A prevention system can block harmful actions before damage occurs. Deepmind measures how well it all works by tracking how much traffic gets monitored, how much misconduct gets caught, and how fast the system responds.

This monitoring has an expiration date, though. It only works as long as the monitored models don't learn to game it.

Models could learn to hide their true intentions

That's why the roadmap ties security measures to measurable milestones in AI capabilities. Deepmind flags two areas worth watching.

The first is evasion. Right now, Deepmind monitors a model's visible chain of thought, but models could learn to conceal their reasoning. That might happen through "oversight awareness," where the model figures out it's being watched, or "opaque reasoning," where it thinks without producing visible text. If either becomes reality, Deepmind says behavioral analysis and inspections of the model's internals would need to kick in.

The second is the potential for damage. Low-risk, reversible actions can be reviewed after the fact by checking logs. But high-risk actions like major cyberattacks need real-time prevention that blocks them before they execute. The roadmap sorts security measures into four detection levels (D1 through D4) and three response levels (R1 through R3).

A diagram on a dark background titled "AI Control Roadmap." Two blue arrows pointing right represent growing AI capabilities: "Detection Evasion Capability" and "Attack Execution Capability." Above the arrows, four detection levels (D1 through D4) list measures like "CoT Monitoring," "Robust Monitors," "Action Monitoring," and "Activations Monitoring." Below the arrows, three response levels (R1 through R3) show defensive measures like "Async Alerts," "Access Control," and "Shutdown Capability." — Deepmind's "AI Control Roadmap" maps out escalation levels for monitoring (D1 through D4) and defense (R1 through R3) as AI systems grow more capable of evading detection and executing attacks. | Image: Google Deepmind

One million coding tasks show overzealousness, not malice

To train the system with real-world data, Deepmind's team built an internal prototype of an async monitoring system for coding agents and ran it across one million tasks. The results go beyond simple keyword filtering and pick up on behavior-based patterns. Deepmind says the system already runs as a live monitor for the Gemini Spark agent and can catch issues like accidental data deletion in real time.

Most flagged events aren't malicious. They're mostly cases of misinterpretation or overzealousness by agents chasing a user's goal too aggressively. That distinction matters as security protocols keep evolving.

Deepmind warns the window for global standards is closing fast

Deepmind also published a separate paper aimed at policymakers. "Three Layers of Agent Security" breaks down security measures for individual agents, multi-agent systems, and the broader ecosystem, covering everything from cyber defense to societal resilience.

In a post on X, Deepmind warns that there's a "narrow window" to lock in security protocols before multi-agent systems scale globally. The company argues that AI labs, governments, and researchers need to treat layered agent security as a shared priority.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Source: Google Deepmind

智能体DeepMind安全/对齐论文/研究

阅读原文