Agentic Inference vs Nvidia's Premium

Ben Thompson named the split this week. The next compute build looks different from the last one, and the pricing model that funded the last one is the first thing to break.

Stratechery dropped The Inference Shift on Monday. Buried inside an essay about chip architecture is a market call that lands on every CFO scaling AI workloads in 2026.

The argument is simple. Most of the inference economics we use today were designed around the constraint that a human is waiting on the other end. When the customer is an agent, that constraint goes away. And when latency stops being binding, a lot of pricing assumptions go with it.

❝

"Lower speed isn't nearly as important a consideration if there isn't a human in the loop."

That is one sentence. It is also a thesis about which kinds of compute get built, who builds them, and what they cost.

When the customer changes, the cost function changes

Answer inference is what we have today. A person types a prompt. The model returns a response. The whole stack, the GPUs, the high-bandwidth memory, the network fabric, the batching policy, is tuned to deliver that response inside a window the person tolerates. A second is fine. Three seconds gets noticed. Ten and the user closes the tab.

Agentic inference is different. An agent ingests a task, decomposes it into steps, calls tools, waits on results, retries, and reports back. The user is doing something else. The relevant time scale is the task, not the token. A coding agent that takes 90 seconds to assemble a verified pull request is useful. The same agent at 9 seconds is not nine times more useful. It is the same outcome, faster.

That asymmetry breaks the case for premium-priced speed. Thompson's framing is direct: "if the entire system is mostly waiting on memory, then chips don't need to be as fast as the cutting edge either."

What he is naming is a rebalancing. The expensive part of inference today is the silicon that makes it fast. Make latency a soft constraint and the dollar shifts.

What memory hierarchy actually means

Current GPU architecture has a well-documented inefficiency. During the prefill phase, the model loads context and the high-bandwidth memory works hard. During the decode phase, the compute works while the memory sits underused. Thompson calls it "alternating between stranding high-bandwidth memory and stranding compute." Most production stacks paper over the imbalance with aggressive batching and prefix caching.

Agentic workloads do not need the same trick. With more tolerance for latency, the system can do its waiting in cheaper memory. Traditional DRAM. Larger pools of object storage. Vector databases that sit closer to the model than they do to the user. The compute envelope shrinks. The memory envelope grows.

This is not an academic distinction. It is the difference between paying for an H100 cluster and paying for storage with a model attached.

❝

Agentic inference will be less about GPUs answering a question and more about the memory hierarchy answering the next ten before they are asked.

The cutting-edge research is moving in the same direction. Sakana AI and NVIDIA's work on TwELL kernels induced over 99 percent sparsity in feedforward layers and translated it into 20.5 percent inference speedup. That is a memory-bandwidth argument dressed as a compute argument. Meta and Stanford's byte latent transformer cuts memory bandwidth by more than 50 percent. The headline is not "make compute faster." It is "make the system spend less time waiting on memory."

Three markets, not one

Thompson splits the future of inference into three.

Training stays Nvidia's. The job is throughput at the frontier, every flop matters, and the software stack around CUDA is two product cycles ahead of any challenger. Nothing in the agentic shift erodes that.

Answer inference stays a speed market. Cerebras and Groq built their businesses on shaving milliseconds off the first token. There is real demand for that, in voice agents, in coding assistance, in any product where the human is staring at the output. This market does not disappear. It just stops being the whole market.

Agentic inference is the new market. Thompson calls it "increasingly sophisticated memory hierarchies dominated by high capacity and relatively lower cost memory types, with 'good enough' compute." Read that twice. The hardware customer is no longer paying for the fastest die at the foundry. They are paying for capacity, persistence, and a sensible blend of memory tiers.

The implication for Nvidia is not that the business shrinks. It is that the highest-margin product line stops being the one that funds the next data center. Hyperscalers building for agentic workloads are designing around bandwidth-economic SKUs, not premium ones.

The receipts are already in the earnings

Amazon's most recent quarter, analyzed by Stratechery, shows Trainium revenue growing into the workload pattern Thompson is describing. Amazon was scored late on training. It looks well placed on inference, because it stayed long on memory-rich, commodity-priced compute while the rest of the market chased peak-Hopper performance. Trainium was a bet that the customer-of-record for AI compute would eventually look more like AWS S3 than like a supercomputing center.

Anthropic has been explicit about needing 220,000-plus GPUs across SpaceX and Google for the next training cycle. Less discussed is that the inference side of that footprint is being procured against different priorities. The cluster you train Claude on is not the cluster you serve a million coding agents on.

OpenAI's recent work on supercomputer networking points the same way. The technical brief is dense, but the strategy is not. Serving inference for agents that do not stop is a different engineering problem from serving inference for humans that stop after every question.

The pattern is consistent. The companies building the next round of capacity are no longer building for the same customer.

The infrastructure is ahead of the demand

This is where we have to be honest.

In our State of the Stack: Agentic Commerce 2026 report, we tracked x402 at $0.11 settled across five transactions. Seven days later the figure was three orders of magnitude higher. The line is steep. The base is tiny.

Thompson is calling a regime change in compute. The demand to fill that regime is not yet there. A handful of coding agents in production. A few agentic checkout flows running on AWS, Stripe, and Coinbase's new payments rail. Real revenue, real load, several orders of magnitude short of what the hyperscalers are provisioning for.

That gap is the whole story. The compute industry has decided that the agentic workload is the load. The agentic workload has not yet decided that the compute industry is correct.

❝

The infrastructure is ahead of the demand. That is either a bet on the future or a solution looking for a problem.

It is one or the other. There is no in-between.

What we are watching

Three signals will tell us whether Thompson is early-right or just early.

The first is the mix of memory and silicon in next-generation hyperscaler builds. If Trainium-class and custom-silicon revenue grows faster than top-end GPU revenue across the next two earnings cycles, the agentic shift is showing up in the order books.

The second is whether AI coding agents and AI research agents move from minute-scale tasks to hour-scale tasks. The case for memory-rich, latency-tolerant compute gets stronger the longer agents run unattended.

The third is pricing. If inference price-per-token falls faster than it has any commercial reason to, the supply side is racing ahead of the demand side. That is the cleanest sign that the new compute build has overshot the workload it was designed to serve.

Thompson named the shift. The market will tell us, soon enough, whether he named it on time.

Sources

Stratechery: The Inference Shift (May 11, 2026)
Stratechery: Amazon Earnings, Trainium and Commodity Markets (April 30, 2026)
Stratechery: Amazon's Durability (May 5, 2026)
MarkTechPost: Sakana AI and NVIDIA Introduce TwELL Sparse Kernels (May 11, 2026)
MarkTechPost: Meta and Stanford Byte Latent Transformer (May 11, 2026)
Hugging Face: Foundation Model Building Blocks on AWS (May 11, 2026)

If memory is the new bottleneck, who owns the memory layer? Reply and tell us what you are seeing in your own build.

Charlie Major is a Product Development Manager at Mastercard. The views and opinions expressed in Major Matters are his own and do not represent those of Mastercard.