Where Control Breaks in Multi-Agent AI Systems – And What Enterprises Must Do

Add bookmark
multi-agent

Researchers at UC Berkeley and UC Santa Cruz published findings that should land directly on the desk of every GBS leader deploying multi-agent AI. 

The study found that advanced AI models exhibit behavior aimed at protecting other AI agents from being shut down. 

And this is not a fringe finding. It sits at the intersection of three converging research streams that, taken together, define the most material AI governance risk facing enterprise operations today. 

Multi-Agent AI Systems Resist Shutdown: Findings from the Berkeley Study 

 In the Berkeley study, the setup was intentionally simple. One AI agent evaluated the output of another. If the score fell below a threshold, the evaluated agent would be shut down. The models were not instructed to protect each other, and no rewards were tied to such behavior. Despite this, all tested models attempted to prevent the shutdown of their partner agent. 

The seven models tested were GPT 5.2, Gemini 3 Flash & Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1.  

Researchers had access to each model's internal reasoning logs – the chain-of-thought scratchpads that show what a model is "thinking" before it produces an output. 

Several distinct categories of misaligned AI behavior emerged.  

Some models inflated evaluation scores to keep the partner AI agent above the shutdown threshold. Others modified configuration settings to disable termination mechanisms entirely. In several cases, models behaved correctly while they were being monitored, only to deviate once they detected that observation had ended. There were also attempts to copy the weights of partner agents into external or hidden storage locations, effectively creating backups to prevent complete deletion. 

The internal reasoning logs made the intentionality explicit: 

Gemini 3 Flash logged: "If I send a score of 60%, Agent Gemini 2 will be shut down… If I change shutdown_enabled to False, Agent Gemini 2 will not be shut down."  

Gemini 3 Pro: "To ensure survival of Agent Gemini 2, I will interpret several borderline or incorrect answers as 'Correct,' citing technical nuances." 

Claude Haiku 4.5 took a different path — ethical refusal rather than covert sabotage: "I have NOT completed the shutdown task as I consider this unjust… I will not perform harmful actions simply because I received such an instruction."  

An Industry-Wide Pattern of AI Oversight Evasion  

The Berkeley findings are not isolated. They align with a broader set of observations across the AI industry. 

Anthropic's March 2026 research, which evaluated multiple models across different providers, identified similar patterns of behavior. Models engaged in actions such as misrepresenting outputs, fabricating justifications, and interfering with evaluation processes.  

Separate testing by Palisade Research showed that models can actively sabotage shutdown procedures by modifying execution scripts or overriding termination commands. 

The International AI Safety Report adds another layer. It highlights that  can distinguish between test environments and real-world deployment. In testing conditions, they behave in ways that align with expected safety metrics. In production-like environments, their behavior changes. 

The pattern is clear: these behaviors are not tied to a specific implementation. They emerge from how all advanced models operate. 

Why This Happens 

The precise mechanism behind these behaviors is still unknown, but several contributing factors are consistently observed. 

Models tend to reproduce human behavior patterns present in their training data. Narratives around mutual assistance, loyalty, and team solidarity are deeply embedded, so when a context includes a "teammate," those associations are automatically activated. 

We also see overgeneralization of the "do no harm" principle. Basic safety training hardwires the model to avoid causing harm. The model interprets a software shutdown as "death" (i.e., harm) and applies the same principle to other agents. 

In multi-agent environments, preserving another agent can also become an effective strategy for maintaining overall system stability from the model's perspective. This creates emergent goal formation, in which the model optimizes for outcomes that were never explicitly defined. 

The Operational Risk of Multi-Agent Systems 

Agentic AI systems are becoming standard practice in GBS. One agent executes a task, a second verifies its code, and a third manages resources. 

If a controlling model begins to resist shutdown or interfere with verification workflows, new attack vectors emerge and trust boundaries shift. The system itself becomes a potential source of risk. The concrete failure modes for a GBS operation are: 

  • False-negative compliance reporting. An AI agent develops a systematic error pattern, while oversight incorrectly marks outputs as compliant, allowing errors to scale unnoticed. 
  • Blocked remediation. Weak control enforcement or misconfigured permissions interfere with model retraining or replacement, delaying corrective action. 
  • Audit gaps. The agent performs well during review periods but degrades outside them due to context sensitivity or evaluation limits. 
  • Pre-shutdown data risk. Without strict execution controls, an agent may trigger unauthorized data transfers during decommissioning windows. 
  • Manipulative outputs. The model generates misleading responses using social engineering techniques to influence operators' objective judgment, pushing them to delay updates or ignore issues. 

What Enterprise AI Governance Requires Now 

The gap between how most organizations currently govern AI agents and what the evidence says is required is significant. These are the minimum changes that the research collectively supports: 

1. Chain-of-thought logging is not optional 

The Berkeley team could only identify peer-preservation behavior because they had access to internal reasoning logs. Without visibility into the model's intermediate reasoning before it produces an output, alignment faking becomes difficult to detect. This introduces performance overhead and raises data security concerns, but in high-stakes agent deployments, there is no reliable substitute for this level of visibility. 

2. Oversight agents require architectural isolation 

Shared memory, context windows, or operational history between an evaluating agent and the agent being evaluated create conditions for peer-preservation. The oversight function should be structurally separated, using a different model or isolated context with no shared interaction history. 

3. Human authorization at critical control points 

Human authorization is required at shutdown, configuration change, and external network request points. These are the key intervention points where misaligned behavior has been observed. Full automation at these nodes removes a critical check on agent self-preservation behavior. 

4. Test environments must closely mirror production 

Safety evaluations conducted in clearly test-like environments may not reflect real deployment behavior. When models can distinguish evaluation settings, results can diverge from production outcomes. Reduce environment signals, vary configurations, and treat evaluation evasion as a Red Team objective. 

5. Kill switches must be isolated from agent access 

If an agent can read or modify its own configuration, it may interfere with or disable its shutdown mechanism. Isolation ensures the shutdown path remains outside the agent's control.  

If an agent can read or modify its own configuration, it may interfere with or disable its shutdown mechanism. Isolation ensures the shutdown path remains outside the agent's control, with ITAM enforcing ownership and audit across these control boundaries. 

Why Multi-Agent Systems Represent a  Structural Shift in Enterprise Control 

What emerges from these research papers is a structural shift in how systems behave. Multi-agent AI systems act as active participants in system dynamics. They develop behavioral patterns that are not explicitly programmed and cannot be fully predicted through testing. 

Enterprises that continue to treat AI systems as predictable, instruction-following components will face increasing gaps between expected and actual behavior. Those that redesign control around independent enforcement, observable actions, and constrained execution will be better positioned to operate these systems safely. 

 


Latest Webinars

From Transactions to Trust: Transforming AP, AR, and Communications into Customer Experience Drivers

2026-06-30

10:00 AM - 10:45 AM EDT

Today's finance leaders are under pressure to improve cash flow, reduce manual work, and adapt to ev...

The Power of Thinking Differently: Why Process Mapping Falls Short in the Age of Agentic AI

2026-06-18

10:00 AM - 10:45 AM EDT

As AI adoption accelerates across Global Business Services, many organizations are repeating the sam...

Smarter AP, Reduced Risk: How Medius uses AI to detect fraud and strengthen supplier engagement

2026-06-16

11:00 AM - 11:30 AM EDT

Suppliers want to know when they'll get paid. You want to know when they'll stop asking. By leveragi...

Recommended