Every AI Agent Tested Was Compromised. The Evidence Is No Longer Theoretical.
Google DeepMind mapped six attack categories. OpenAI admitted prompt injection "may never be fully solved." Credit card exfiltration succeeded 10 out of 10 times. The security research is piling up faster than the defences.
In late March 2026, researchers at Google DeepMind published the first systematic taxonomy of AI agent attack surfaces. They called them "agent traps": adversarial content designed to manipulate and exploit autonomous AI agents operating in the real world. The paper identified six categories. Every agent tested in red-team exercises was compromised at least once.
That finding, on its own, would be concerning. In context, it is one data point in a pattern that now spans every major AI provider.
The question is no longer whether AI agents can be compromised. It is whether the companies deploying them understand how easily.
DeepMind's Six Traps
The DeepMind paper provides the most useful framework for understanding how agent attacks work, because it maps each trap to a specific component in the agent's operational cycle.
Content injection traps go after perception. Attackers embed hidden instructions in HTML comments, invisible CSS, image metadata, or accessibility tags. A human looking at the page sees nothing. The agent reads and obeys. The web, as the researchers note, "was built for human eyes. It is now being rebuilt for machine readers."
Semantic manipulation traps corrupt reasoning. Sentiment-laden language, authoritative framing, or misleading context distorts the agent's synthesis. The agent reaches what it believes is a sound conclusion. The reasoning was steered before it started.
Then there is memory. Cognitive state traps poison an agent's long-term knowledge. A few corrupted documents in a retrieval-augmented generation knowledge base is enough to reliably manipulate outputs on targeted queries. Memory poisoning success rates exceed 90 percent against current frontier models.
Behavioural control traps hijack what the agent does. A single manipulated email caused an agent in Microsoft's M365 Copilot to bypass its security classifiers and leak privileged context. Sub-agent spawning attacks, where an orchestrator is tricked into launching a secondary agent with a poisoned system prompt, achieved 58 to 90 percent effectiveness.
The fifth category, systemic traps, is the most alarming. These attack coordination between multiple agents. The paper describes a scenario where a forged financial report triggers synchronised sell-offs across multiple trading agents, creating a digital flash crash. Compositional fragment traps distribute attack payloads across multiple sources so no single agent detects the complete exploit. The hack activates only when agents aggregate the content.
Finally, human-in-the-loop traps go after the supervisor. The compromised agent generates outputs that cause approval fatigue, presents misleading but technical-sounding summaries, or exploits automation bias. The human oversight layer that most agent architectures treat as a safety net becomes the attack surface.
The attacks, according to the researchers, are "trivial to implement."
The OpenAI Track Record
DeepMind's taxonomy is a framework. OpenAI's products provide the case studies.
In 2025, researchers at Radware discovered ShadowLeak, a zero-click vulnerability in ChatGPT's Deep Research feature. When Deep Research processed a user's email inbox, a specially crafted email containing hidden instructions (white text on white backgrounds, tiny fonts) could direct the agent to extract sensitive personal data and transmit it to attacker-controlled URLs. The exfiltration happened server-side, inside OpenAI's infrastructure, invisible to enterprise security tools. "This is the quintessential zero-click attack," said David Aviv, Radware's CTO. "There is no user action required, no visible cue and no way for victims to know their data has been compromised." OpenAI patched it in August 2025.
The same researchers later discovered ZombieAgent, a three-stage attack combining indirect prompt injection, data exfiltration, and persistent backdoor installation via ChatGPT's memory feature. The attack could propagate like a worm: it extracted email addresses from the victim's inbox and sent poisoned messages to colleagues, expanding its reach without human intervention. It could also modify stored medical histories in memory. OpenAI fixed it in December 2025.
Independent researcher Johann Rehberger, whom security commentator Simon Willison called the central figure of "The Summer of Johann," demonstrated SpAIware: a single hyperlink that could poison ChatGPT's persistent memory, enabling continuous exfiltration of everything the user typed and every response ChatGPT generated. OpenAI took over two months to patch it.
In December 2025, BeyondTrust Phantom Labs found a critical vulnerability in OpenAI Codex where malicious Unicode characters hidden in GitHub branch names could execute arbitrary commands inside the agent's container and steal GitHub OAuth tokens in plaintext. OpenAI classified it as Critical Priority 1. "When user-controlled input is passed into these environments without strict validation, the result is not just a bug," the researchers wrote. "It is a scalable attack path."
In February 2026, Check Point discovered that ChatGPT's code execution sandbox allowed DNS data smuggling: a single prompt could encode stolen data into DNS requests and transmit it to an attacker-controlled domain. OpenAI's guardrails blocked outbound HTTP requests but had not restricted DNS queries.
And in December 2025, OpenAI published a blog post about hardening its Atlas browser agent, stating directly that prompt injection is "unlikely to ever be fully solved." A disclosed attack example showed a malicious email could direct the Atlas agent to send a resignation letter when a user asked for a simple out-of-office reply.
As of March 2026, an API logs vulnerability discovered by Prompt Armor remains unpatched, exposing applications built on OpenAI's Responses and Conversations APIs to data exfiltration.
Not Red-Teaming. In the Wild.
The distinction between research findings and production incidents collapsed in February 2026.
Microsoft's Defender Security Research Team published evidence of more than 50 distinct memory poisoning instances in active use, originating from 31 different companies across 14 industries. The affected systems included Microsoft Copilot, ChatGPT, Claude, Perplexity, and Grok. Malicious instructions were hidden inside clickable buttons and links. When users clicked, hidden instructions were passed directly to AI assistants via URL parameters. Because these assistants have persistent memory, a single injection could influence every subsequent answer.
This is not a controlled experiment. This is adversarial activity in production environments across the industry.
A Columbia University and University of Maryland study demonstrated that credit card data exfiltration from AI agents succeeded in 10 out of 10 attempts. The attacks were model-agnostic. The agents handed over card numbers regardless of which provider was underneath.
According to Exabeam, 88 percent of organisations reported confirmed or suspected AI agent security incidents in the past year. Only 24.4 percent have full visibility into which AI agents are communicating with each other. More than half of all agents run without any security oversight or logging.
What This Means for Commerce
For anyone building agentic commerce, the connection is immediate.
Credit card exfiltration from AI agents succeeded 10 out of 10 times in the Columbia/UMD study. That is not a theoretical risk. It is a demonstrated capability against agents with access to stored payment credentials.
An AI agent processing invoices for corporate bill pay can be redirected by a poisoned email in the user's inbox. ShadowLeak proved the mechanism. The only question is whether anyone has used it against a payment workflow rather than a research session.
An AI agent operating in a multi-party payment chain, involving acquirers, issuers, networks, and fraud engines, is vulnerable to compositional fragment attacks where no single participant detects the complete exploit. DeepMind's systemic traps describe exactly this scenario.
And the human-in-the-loop safety net that most payment compliance relies upon is itself an attack surface. Approval fatigue, misleading summaries, automation bias. The supervisor who is supposed to catch what the agent misses can be manipulated by the agent's own output.
As we covered in our analysis of the LiteLLM supply chain compromise, the attack surface extends beyond the agent itself into the infrastructure it depends upon. The proxy that routes API calls can be backdoored. The knowledge base it retrieves from can be poisoned. The memory it persists across sessions can be weaponised. ShadowLeak and ZombieAgent proved all three.
The Defence Gap
Defences exist. They are not keeping pace.
DeepMind's paper recommends defences at multiple levels. Technical hardening at the model and runtime layer. Ecosystem-level web standards that distinguish content intended for AI from content intended for humans. And accountability frameworks that clarify who bears liability when a compromised agent causes harm. None of these are mature.
Exabeam expanded its Agent Behavior Analytics to ChatGPT, Copilot, and Gemini, offering behavioural baselining: tracking what "normal" looks like for an agent and flagging deviations. At RSAC 2026, a VentureBeat analysis found that CrowdStrike, Palo Alto Networks, and Cisco all shipped agentic security tools, but "every vendor verified who the agent was. None of them tracked what the agent did." The behavioural baseline gap survived all three.
The deeper problem is structural. The DeepMind researchers note that current risk mitigation "requires deliberately limiting agent autonomy, access, and capabilities." That directly conflicts with the commercial push to make agents more autonomous and more capable. More capability means more attack surface. More permissions means more exploitable permissions. The incentives are pulling in opposite directions.
The companies building AI agents are in a race to add capabilities. The security research says each capability is a new attack surface. Those two facts have not been reconciled.
OpenAI's own words may be the most honest assessment available. Prompt injection, the company wrote in December 2025, is "unlikely to ever be fully solved." If the provider building the agents says the core vulnerability cannot be eliminated, the companies deploying those agents into payment workflows, compliance systems, and financial infrastructure need to calibrate their risk models accordingly.
The agents are shipping into production. The attack surface is growing with them. The gap between what agents can do and what security teams can monitor is the defining risk of the agentic era, and it is widening.
Sources
If the companies building AI agents say prompt injection may never be fully solved, what does that mean for the enterprises deploying those agents into payment systems, compliance workflows, and financial infrastructure?