GPT-5.4 vs Claude: The Agent Layer War

The agent that controls your software will soon control your transactions.

On March 5, 2026, OpenAI launched GPT-5.4, its first general-purpose model with native computer use capabilities. The model can interpret screenshots, issue mouse and keyboard commands, and navigate software applications autonomously, completing multi-step workflows that previously required a human at the keyboard.

It is not the first AI model to do this. Anthropic has been shipping computer use capabilities since late 2024 and recently acquired a startup specifically to accelerate the technology. But GPT-5.4 is the loudest signal yet that the era of AI agents operating inside your applications has arrived, and that every major AI company is betting its future on it.

The tech press is covering the benchmarks. We want to ask a different question.

❝

When AI agents can operate software autonomously, they will initiate purchases, approve invoices, and execute transactions. The agent layer is becoming the new checkout. Whoever controls it will control the transaction.

What GPT-5.4 Ships

OpenAI describes GPT-5.4 as its "most capable and efficient frontier model for professional work." The headline capability is native computer use: the model can write code to operate computers via libraries like Playwright and issue direct mouse and keyboard commands in response to screenshots, according to VentureBeat.

The benchmarks tell the story. On OSWorld Verified, which measures an AI model's ability to navigate real desktop environments, GPT-5.4 scored 75.0 percent, up from 47.3 percent for GPT-5.2, according to The Decoder. The human comparison group scored 72.4 percent, making this the first time OpenAI's model has surpassed human performance on this specific test.

On BrowseComp, which measures how well AI agents can track down hard-to-find information across the web, GPT-5.4 scored 82.7 percent and GPT-5.4 Pro reached 89.3 percent, up from 65.8 percent for GPT-5.2.

Beyond computer use, GPT-5.4 introduces several features aimed squarely at agentic workflows:

1 million token context window in the API and Codex, enabling agents to plan, execute, and verify tasks across long horizons, though pricing doubles once input exceeds 272,000 tokens
Tool Search, a structural fix for large tool ecosystems that retrieves tool definitions on demand rather than loading them all into the prompt, reducing token usage by 47 percent in tests
Professional work capabilities including spreadsheet creation and editing (87.3 percent on internal financial modelling tests, up from 68.4 percent) and presentation generation preferred by human evaluators in 68 percent of cases

Pricing sits at $2.50 per million input tokens and $15 per million output tokens for the standard model, with a Pro version at $30 and $180 respectively, per VentureBeat.

❝

GPT-5.4 is not a chatbot upgrade. It is a platform play for AI that does the work, not just talks about it.

What Anthropic Is Building

Anthropic has been shipping computer use capabilities since October 2024, when Claude 3.5 Sonnet became the first major model to offer screenshot-based desktop navigation. The company has been iterating aggressively since.

On the same OSWorld benchmark where GPT-5.4 scores 75.0 percent, Anthropic's Sonnet 4.6 reaches 72.5 percent, up from under 15 percent when computer use was first introduced, according to Channel Post MEA. That is a fivefold improvement in roughly 16 months.

To accelerate further, Anthropic acquired Seattle-based AI startup Vercept on February 25, 2026. The nine-person team, led by co-founders with deep AI perception and vision expertise, is now fully integrated into Claude's development, according to AdwaitX. Vercept's thesis was that making AI genuinely useful for complex tasks requires solving hard perception and interaction problems, exactly the bottleneck in reliable computer use.

Where Anthropic diverges most from OpenAI is in how it deploys agentic capabilities. Rather than concentrating on the API, Anthropic is pushing agents across multiple product surfaces:

Claude Code for developers, offering terminal-based agentic coding
Cowork for non-developers, an agentic workspace in the Claude Desktop app for macOS that runs inside an isolated virtual machine, handling file management, data extraction, and document workflows through natural language
Claude in Excel and Claude in PowerPoint for office productivity, now powered by Opus 4.6
Claude Opus 4.6 with the longest task-completion time horizon measured by METR: a 50 percent time horizon of 14 hours and 30 minutes, according to Wikipedia's summary of METR evaluations

That last data point matters. Sustained autonomous work over hours, not minutes, is the dividing line between a tool that helps you and an agent that works for you.

❝

Anthropic is not just building a model that can use a computer. It is building an ecosystem where agents work across developer tools, desktop apps, and office software simultaneously.

How They Compare

The benchmark gap between GPT-5.4 and Claude Sonnet 4.6 on computer use is just 2.5 percentage points on OSWorld. In practical terms, that gap is negligible. Both models are at or near human-level performance on desktop navigation tasks, and both will continue to improve rapidly.

The meaningful differences lie elsewhere.

Developer experience. OpenAI's Tool Search is a genuine innovation for developers building agentic systems with large tool ecosystems. By retrieving tool definitions on demand, it reduces token overhead by 47 percent, directly lowering the cost of complex agent workflows. Anthropic has not shipped an equivalent feature yet.

Consumer accessibility. Anthropic is ahead here. Cowork brings agentic capabilities to knowledge workers who will never open a terminal. OpenAI is primarily shipping computer use through the API and Codex, which are developer-facing products. ChatGPT's existing agent mode has been available but, as The Decoder noted, it "worked unreliably and was rarely used."

Long-horizon autonomy. Anthropic's METR results for Opus 4.6 suggest Claude can sustain coherent autonomous work for hours. OpenAI has not published comparable long-horizon data for GPT-5.4.

Safety architecture. Anthropic's Cowork runs inside an isolated virtual machine with explicit permission gates for destructive actions. OpenAI has not detailed equivalent guardrails for GPT-5.4's computer use capabilities beyond its standard safety stack.

Token efficiency. GPT-5.4's Tool Search gives OpenAI an edge in multi-tool environments. Anthropic's MCP (Model Context Protocol) ecosystem takes a different approach, standardising how AI models connect to external tools, which may prove more durable as the tool landscape fragments.

❝

The real competition is not about which model scores higher on a benchmark. It is about which company builds the agent platform that enterprises trust to operate inside their workflows.

The Payments Question Nobody Is Asking

Here is where this gets interesting for anyone in payments, commerce, or fintech.

If AI agents can navigate software, click buttons, fill in forms, and complete multi-step workflows autonomously, they will also make purchases. They will approve invoices. They will manage subscriptions. They will compare prices across vendors and execute transactions on your behalf.

This is not theoretical. OpenAI's GPT-5.4 benchmarks explicitly highlight financial modelling and spreadsheet automation. Anthropic's Cowork already operates inside file systems and applications. The path from "AI manages my spreadsheet" to "AI initiates a purchase order" is short.

That raises questions that the AI industry is not yet answering:

Authentication. Current payment authentication assumes a human is present. Strong Customer Authentication under PSD2 requires two of three factors: something the customer knows, has, or is. When the "customer" is an AI agent acting on a human's behalf, which factors apply? How does a merchant verify that the agent is authorised to transact?

Liability. When an AI agent makes a payment error, overcharges, or purchases the wrong product, who bears the liability? The user who delegated authority? The AI company whose model made the decision? The merchant who accepted the transaction? Current card network rules were not written for this scenario.

Interchange. Interchange fee structures assume a cardholder-merchant relationship mediated by an issuer and acquirer. When an AI agent initiates a card-not-present transaction on behalf of a user, potentially across multiple merchants in a single workflow, how should interchange be calculated? Is it one transaction or many? Who is the "merchant of record" when the agent is orchestrating across platforms?

The agent as checkout. If your AI agent becomes the primary way you interact with software, the agent layer effectively becomes the new checkout. The company that controls the default agent in enterprise and consumer workflows will control the transaction flow, including where it routes, which payment method it selects, and which merchant it favours.

❝

The agent layer is not just a technology shift. It is a payments distribution shift. Whoever owns the agent owns the checkout.

Visa and Mastercard have both been investing in tokenisation and digital identity infrastructure that could serve as the authentication layer for agent-initiated transactions. Mastercard's work on its Foundry platform and universal trust layer, in particular, suggests the company sees this convergence coming. But neither network has published a public framework for agent commerce.

What to Watch

The agent layer war between OpenAI and Anthropic is now a live competition, and it will accelerate. Here is what to track:

Agent identity standards. Expect early proposals for how AI agents identify themselves to merchants and payment networks. This will likely build on existing tokenisation and delegated authentication frameworks but will require new standards bodies to formalise.

Card network positioning. Watch for Visa and Mastercard announcements on agent-initiated transactions. The network that moves first to define interchange rules and authentication standards for AI agents will have a significant structural advantage.

Regulatory response. The EU's AI Act already classifies certain AI systems by risk level. Autonomous agents that initiate financial transactions are likely to attract scrutiny from both AI regulators and financial services authorities. The UK's FCA and the European Banking Authority have not yet addressed this intersection.

Enterprise adoption patterns. The question is not which model enterprises will choose for computer use. It is which agent platform they will trust to operate inside financial workflows. Trust, auditability, and safety architecture will matter more than raw benchmark scores.

Default agent dynamics. The company whose agent becomes the default in enterprise productivity suites will control an enormous amount of transaction routing. This is why Microsoft's quiet integration of AI agents into Windows 11 and Google's Gemini development deserve attention alongside OpenAI and Anthropic.

Sources

When AI agents start making purchases on your behalf, who should be responsible for the transaction: you, the agent, or the company that built it?