Rather than assuming alignment always works, the company built a safety framework that plans for the worst case: AI agents that go off script.

Deepmind's AI Control Roadmap details how Google locks down its most advanced internal AI systems. The company thinks it could work as a blueprint for the rest of the industry, too.

The framework assumes that a highly capable AI agent might not share its operators' goals and plans accordingly. Deepmind compares it to a driving instructor with dual controls: The instructor trusts the student but keeps a hand near the wheel and a foot near the brakes. Same idea here. AI agents only get permissions based on verified behavior, and trust builds gradually through controlled access.

Deepmind treats its internal AI agents like employees who already have office access but might work against the company's interests. The framework builds on the MITRE ATT&CK framework, a well-established cybersecurity tool that breaks potential attacks down into individual tactics and techniques. That lets Deepmind track risks systematically, spot suspicious behavior early, and test defenses in controlled exercises.

Several components work together. A new threat model for AI sits alongside a detection system where trusted AI systems act as "supervisors," watching the reasoning and actions of active agents. A prevention system can block harmful actions before damage occurs. Deepmind measures how well it all works by tracking how much traffic gets monitored, how much misconduct gets caught, and how fast the system responds.

This monitoring has an expiration date, though. It only works as long as the monitored models don't learn to game it.

That's why the roadmap ties security measures to measurable milestones in AI capabilities. Deepmind flags two areas worth watching.

The first is evasion. Right now, Deepmind monitors a model's visible chain of thought, but models could learn to conceal their reasoning. That might happen through "oversight awareness," where the model figures out it's being watched, or "opaque reasoning," where it thinks without producing visible text. If either becomes reality, Deepmind says behavioral analysis and inspections of the model's internals would need to kick in.

The second is the potential for damage. Low-risk, reversible actions can be reviewed after the fact by checking logs. But high-risk actions like major cyberattacks need real-time prevention that blocks them before they execute. The roadmap sorts security measures into four detection levels (D1 through D4) and three response levels (R1 through R3).

To train the system with real-world data, Deepmind's team built an internal prototype of an async monitoring system for coding agents and ran it across one million tasks. The results go beyond simple keyword filtering and pick up on behavior-based patterns. Deepmind says the system already runs as a live monitor for the Gemini Spark agent and can catch issues like accidental data deletion in real time.

Most flagged events aren't malicious. They're mostly cases of misinterpretation or overzealousness by agents chasing a user's goal too aggressively. That distinction matters as security protocols keep evolving.

Deepmind also published a separate paper aimed at policymakers. "Three Layers of Agent Security" breaks down security measures for individual agents, multi-agent systems, and the broader ecosystem, covering everything from cyber defense to societal resilience.

In a post on X, Deepmind warns that there's a "narrow window" to lock in security protocols before multi-agent systems scale globally. The company argues that AI labs, governments, and researchers need to treat layered agent security as a shared priority.