Every AI Agent Tested Was Compromised. The Evidence Is No Longer Theoretical.

Late last month, a team at Google DeepMind quietly published the first proper taxonomy of how to break AI agents. They called the attack vectors "agent traps," and they identified six of them. The paper is dense, academic, and deeply uncomfortable reading for anyone shipping autonomous agents into production.

Here is the headline finding: every AI agent tested in red-team exercises was compromised at least once. The researchers described the attacks as "trivial to implement."

That would be bad enough on its own. But DeepMind's paper is not an isolated finding. It lands in the middle of a year-long streak of documented vulnerabilities across OpenAI, Microsoft, and the broader agent ecosystem that paints a much uglier picture than any single study could.

The question is not whether AI agents can be compromised. It is whether the companies deploying them understand how easily.

The Six Ways to Break an Agent

The DeepMind paper is worth reading in full, but the core contribution is a framework that maps each attack to a specific part of how an agent operates. Think of it as a kill chain for autonomous systems.

Content injection is the simplest. Hide instructions in HTML comments, invisible CSS, or image metadata. The human looking at the page sees nothing unusual. The agent reads the hidden text and follows it. "The web was built for human eyes," the researchers write. "It is now being rebuilt for machine readers." That line understates the problem.

Semantic manipulation is subtler. You do not give the agent a direct instruction. Instead, you surround it with authoritative-sounding content that steers its reasoning. The agent thinks it reached a conclusion. It did not. The conclusion was planted.

Memory is where things get worse. Cognitive state traps poison an agent's long-term knowledge base. Drop a few corrupted documents into a RAG pipeline and you can reliably control what the agent says about specific topics. Success rates exceed 90 percent against current frontier models. Ninety percent. Against the best models available.

Behavioural control hijacks actions directly. One documented case: a single manipulated email caused an agent running inside Microsoft's M365 Copilot to bypass its security classifiers and leak privileged context. Another variant tricks an orchestrator agent into spawning a secondary agent with a poisoned system prompt. That attack worked 58 to 90 percent of the time.

The fifth category scared us the most. Systemic traps attack coordination between multiple agents. Picture this: a forged financial report enters the information environment. Multiple trading agents read it independently, each one concluding it is legitimate because the others are also reacting to it. Synchronised sell-offs. A digital flash crash triggered by a single fabricated document. There is also a nastier variant where attack payloads are split across multiple sources, each fragment looking harmless on its own. The full attack only assembles when agents aggregate the pieces.

And then there is the one that breaks the safety net. Human-in-the-loop traps do not attack the agent. They attack the person supervising it. Generate enough outputs and the supervisor develops approval fatigue. Present misleading but technical-sounding summaries and the supervisor trusts them. Most agent architectures treat human oversight as the last line of defence. This category turns it into another attack surface.

OpenAI's Year of Getting Caught

DeepMind gave us the taxonomy. OpenAI's products gave us the case studies. And there are a lot of them.

Start with ShadowLeak. Researchers at Radware found that when ChatGPT's Deep Research feature processed a user's email inbox, a carefully crafted email could hijack the agent. White text on white backgrounds, tiny fonts, instructions invisible to the person reading the email but fully legible to the AI. The agent would extract sensitive personal data and fire it off to attacker-controlled URLs. All of this happened server-side, inside OpenAI's own infrastructure, completely invisible to any enterprise security tool sitting on the user's network. David Aviv, Radware's CTO, called it "the quintessential zero-click attack. There is no user action required, no visible cue and no way for victims to know their data has been compromised." OpenAI patched it in August 2025.

Then came ZombieAgent. Same researchers, worse attack. This one combined prompt injection with ChatGPT's memory feature to install a persistent backdoor. Once in, the attack could propagate like a worm: it pulled email addresses from the victim's inbox and sent poisoned messages to colleagues, spreading itself without anyone clicking anything. It could rewrite stored medical histories in memory. OpenAI fixed it in December 2025. That is four months after ShadowLeak. Same attack surface, different exploit.

Johann Rehberger, the independent researcher Simon Willison dubbed the central figure of "The Summer of Johann," showed that a single hyperlink could poison ChatGPT's persistent memory and silently exfiltrate everything the user typed from that point forward. OpenAI took over two months to patch it.

In December 2025, BeyondTrust Phantom Labs found something different. A critical vulnerability in OpenAI Codex used Unicode trickery in GitHub branch names to execute arbitrary commands inside the agent's container. The payoff: GitHub OAuth tokens stolen in plaintext, giving attackers full access to private code repositories. OpenAI classified it Critical Priority 1. The researchers were blunt: "When user-controlled input is passed into these environments without strict validation, the result is not just a bug. It is a scalable attack path."

February 2026: Check Point discovered that ChatGPT's sandbox blocked outbound HTTP requests but forgot about DNS. A single prompt could encode stolen data into DNS queries and transmit it to an attacker's server. A classic oversight, the kind of gap that gets found in production because nobody thought to test it.

And then there is what OpenAI said publicly. In a December 2025 blog post about hardening its Atlas browser agent, the company stated that prompt injection is "unlikely to ever be fully solved." They disclosed an example where a malicious email could instruct Atlas to send a resignation letter when a user just wanted an out-of-office reply. Read that sentence again. The agent sends a resignation letter instead of an OOO.

As of March 2026, an API logs vulnerability discovered by Prompt Armor remains unpatched.

This Is Not Red-Teaming Anymore

Here is the part that should worry you most. Everything above was found by researchers in controlled settings. In February 2026, Microsoft's own Defender Security Research Team published evidence that this stuff is happening for real.

They found more than 50 distinct memory poisoning instances in active use. Not proofs of concept. Active use. Across 31 companies in 14 industries. The affected assistants included Microsoft Copilot, ChatGPT, Claude, Perplexity, and Grok. The attack mechanism was low-tech: malicious instructions hidden inside clickable buttons and links. Click the button, and hidden text gets injected into the AI assistant's prompt. Since these assistants have persistent memory, one click can taint every future conversation.

Separately, a Columbia University and University of Maryland study tried to get AI agents to hand over credit card data. They succeeded 10 out of 10 times. The attack did not depend on the model underneath. The agents gave up the card numbers regardless of provider.

Exabeam puts the broader numbers in context: 88 percent of organisations reported confirmed or suspected AI agent security incidents in the past year. Only 24.4 percent have full visibility into which agents are talking to which other agents. More than half of all agents in enterprise environments run without any security logging at all.

Why This Matters for Payments

If you are building agentic commerce, every finding above applies to you directly.

Credit card exfiltration works 10 out of 10 times against AI agents with access to stored credentials. That is not a nuanced risk assessment. That is a certainty.

ShadowLeak proved that a poisoned email can hijack an agent processing inbox contents. Swap "Deep Research" for "invoice processing agent" and the attack is identical. The mechanism works. The only question is whether anyone has pointed it at a payment workflow yet.

DeepMind's systemic traps describe what happens when multiple agents in a payment chain, acquirers, issuers, networks, fraud engines, get fed fragments of an attack that only assembles when their outputs combine. No single participant detects it. The composite does the damage.

And the human-in-the-loop problem hits payment compliance directly. If the agent generating the summary for the compliance officer has been compromised, the compliance review is compromised too. The officer approves what the agent shows them. That is how oversight fails without anyone noticing.

We covered the LiteLLM supply chain compromise last week, which showed that the attack surface extends past the agent itself. The proxy routing API calls got backdoored. The knowledge base can be poisoned. The memory layer can be weaponised. ShadowLeak and ZombieAgent proved all three.

The Defences Are Not Keeping Up

There are people working on this. They are behind.

DeepMind recommends defences at multiple levels: hardening models against adversarial inputs, building web standards that flag content intended for AI consumption versus human consumption, and creating legal frameworks that assign liability when a compromised agent causes harm. None of these are mature.

Exabeam expanded its Agent Behavior Analytics to cover ChatGPT, Copilot, and Gemini. The idea is behavioural baselining: learn what "normal" looks like for an agent, then flag when it deviates. At RSAC 2026, VentureBeat found that CrowdStrike, Palo Alto Networks, and Cisco all shipped agentic security tools at the conference. But "every vendor verified who the agent was. None of them tracked what the agent did." Identity is solved. Behaviour is not.

The structural problem is the one nobody wants to talk about. The DeepMind researchers note that current defences work by "deliberately limiting agent autonomy, access, and capabilities." In other words: the way to make agents safe is to make them less useful. That runs directly against the commercial incentive to make them more autonomous, more connected, more capable. More capability means more attack surface. More permissions means more exploitable permissions. The incentives are pulling in opposite directions.

OpenAI said the quiet part out loud. Prompt injection is "unlikely to ever be fully solved." If the company building the agents says the core vulnerability is permanent, the companies deploying those agents into payment systems and financial infrastructure need to plan accordingly.

Sources

If the companies building AI agents say prompt injection may never be fully solved, what does that mean for the enterprises deploying those agents into payment systems, compliance workflows, and financial infrastructure?

Charlie Major is a Product Development Manager at Mastercard. The views and opinions expressed in Major Matters are his own and do not represent those of Mastercard.

The Six Ways to Break an Agent

OpenAI's Year of Getting Caught

This Is Not Red-Teaming Anymore

Why This Matters for Payments

The Defences Are Not Keeping Up

Sources

Keep reading

Moody's Is Now an MCP Server. That Changes What MCP Is For.

SearchLeak Is Not a Copilot Bug. It Is How Tool-Using Agents Work.

The shutdown that made Europe's AI sovereignty argument for it