LLMMonday, April 13, 2026·7 min read

Your intuition of LLM token usage might be wrong

AI Agents Daily

Curated by AI Agents Daily team · Source: HN LLM

Your intuition of LLM token usage might be wrong

Why This Matters

A developer's real-world experiment with GPT-5.4-mini revealed that cached token reads, not regular input or output tokens, dominate LLM agent sessions by a massive margin. In a 30-minute coding session, cache reads hit 26.2 million tokens while regular input sat at just 3.6 mill...

A developer writing at andreani.in published findings on April 13, 2026, that should make anyone running LLM-powered agents stop and recalculate their assumptions. The post, titled "Your intuition of LLM token usage might be wrong," walks through a real coding session using GPT-5.4-mini and the open-source agent harness oh-my-pi, revealing that the numbers behind a typical agentic workflow look nothing like what most developers expect.

Why This Matters

Cache reads are the silent budget killer that almost nobody is accounting for, and this post makes that concrete with actual session data rather than theoretical hand-waving. The gap between 3.6 million regular input tokens and 26.2 million cache reads, a ratio of roughly 7 to 1, means developers building multi-turn agents are routinely underestimating their consumption by an order of magnitude. If LLM providers factor cache reads into usage limits, and the author strongly suspects they do, then every team running long agentic sessions is flying blind on their real burn rate. That is a billing and architecture problem, not a minor footnote.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

The experiment started as a routine coding task. The developer spent about 30 minutes working alongside GPT-5.4-mini to change how a service loaded two SQLite databases, switching from startup-time loading to per-request loading. The agent had to read across multiple services in a monorepo, update 5 files, handle a second service called Service B, update a deploy script, and write project documentation. Nothing exotic. A typical mid-sized agentic task.

When the session ended, the oh-my-pi harness surfaced the token summary. The regular input count came in at 3,648,340 tokens and output at 61,676 tokens. Those numbers feel roughly plausible to most developers. The agent read a lot of code, wrote a little. Makes sense.

Then the cache read figure appeared: 26,257,024 tokens. That single number is roughly seven times larger than the regular input count and over 400 times larger than the output count. The author describes it as "a whole magnitude bigger than the regular reads," which is accurate. What is actually happening is that every conversational turn in a long agentic session re-reads the entire accumulated context from cache. The context does not disappear between turns. It sits there, and every new message pulls it back through the model.

The author ran a quick sanity check to confirm this interpretation. With oh-my-pi reporting the context at 76.6 percent of a 272,000-token window, roughly 208,352 tokens of live context, they sent one more message asking for a summary of changes without reading any files. The new token counts told the story exactly. Regular input went up by 145 tokens (the new message), output went up by 354 tokens (the response), and cache reads went up by 208,384 tokens. That is essentially the full context window being re-read, down to the rounding error. The math checked out almost perfectly.

The conclusion the author draws is blunt: LLM providers almost certainly factor cache reads into usage limits, even if they are not loudly advertising that fact. Keeping context short is not just good hygiene. It is directly tied to how far your usage allowance actually stretches across a working day or a billing cycle.

Key Details

The 30-minute session on April 13, 2026 used GPT-5.4-mini via the oh-my-pi agent harness.
Regular input tokens for the full session: 3,648,340.
Output tokens for the full session: 61,676.
Cache read tokens for the full session: 26,257,024, roughly 7.2 times the regular input count.
The context window at session end sat at 76.6 percent of a 272,000-token limit, approximately 208,352 tokens.
A single follow-up message added 145 input tokens, 354 output tokens, and 208,384 cache read tokens.
The combined total token count across all categories at session end: 30,175,923.
The agent updated 5 files across multiple services in a monorepo during the session.

What's Next

Developers building production agents need to start treating context length as a first-class cost variable, not a convenience setting. Tools that surface cache read counts alongside regular input and output counts will become essential for any team running long multi-turn sessions, and agent harnesses like oh-my-pi that already expose this data are ahead of the curve. Expect more pressure on providers like OpenAI and Anthropic to publish explicit documentation on how cache reads factor into rate limits and billing, because the developer community is now paying close attention to related AI news on this front.

How This Compares

Anthropic introduced prompt caching for Claude in August 2024, explicitly pricing cache reads at a discount compared to regular input tokens. At launch, cached tokens cost roughly 10 percent of the standard input token price. That pricing model acknowledges that cache reads happen constantly in long sessions, but it also means they are absolutely being counted and billed. The findings in this post map directly onto that reality. OpenAI followed with similar caching mechanics for its API, and the pricing structure there also bills cache reads separately from standard input.

What makes this post distinctive is not the existence of cache reads, which is documented API behavior, but rather the sheer scale of the ratio that emerges in a real agentic workflow. Most documentation around prompt caching talks about it as a cost-saving feature. This post flips that framing and shows that cache reads are actually the dominant cost driver, not a footnote to the main event.

Compare this to ongoing discussions in the developer community about context window management for retrieval-augmented generation systems. The conventional advice has been to use as much context as needed for accuracy. That advice remains sound, but this experiment adds a sharper edge to it: every token you leave in context is being re-read on every single turn. For a 30-minute session with dozens of turns, that multiplication effect is enormous. Teams building AI tools for enterprise use cases should factor this into architecture decisions around when to summarize and compress context mid-session rather than letting it grow unchecked.

FAQ

Q: What are cache read tokens in an LLM session? A: Cache read tokens are the tokens the model re-reads from your accumulated conversation context at every turn. Even if you only send a short new message, the model processes the full history of the conversation from cache. In a long session, these cache reads add up to far more tokens than your actual new inputs or outputs.

Q: Do LLM providers charge for cache read tokens? A: Yes, major providers including Anthropic and OpenAI bill for cache reads, typically at a discounted rate compared to standard input tokens. Anthropic priced Claude cache reads at roughly 10 percent of standard input cost when the feature launched in 2024, but the volume of cache reads in a long agentic session can still make them the largest cost category overall.

Q: How can I reduce cache token usage in my AI agent? A: Keep your context window as short as practical by summarizing completed work and removing irrelevant history as sessions progress. Checking out guides on context management for agentic workflows can help. The core principle is that every token sitting in your context gets re-read on every turn, so trimming unnecessary content has a compounding benefit across a long session.

The gap between developer intuition and LLM billing reality is growing as agent sessions get longer and more complex, and posts like this one from andreani.in are doing important work in closing that gap with real data. Any team running production agents should run this same analysis on their own sessions before they hit a billing surprise. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. We are tracking this development closely and will report on follow-up impacts as they emerge.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

Your intuition of LLM token usage might be wrong

Why This Matters

The Full Story

Key Details

What's Next

How This Compares

FAQ

Get stories like this daily

More in LLM

Milla Jovovich's New Open Source LLM Memory App and the Dark Code Problem

Show HN: Bloomberg Terminal for LLM ops – free and open source

Show HN: Mcptube – Karpathy's LLM Wiki idea applied to YouTube videos

Learn more — Guides