Stop comparing price per million tokens: the hidden LLM API costs
A new analysis from TensorZero shows that comparing LLM API costs by price per million tokens is fundamentally broken, because different models tokenize the same text very differently. On tool-heavy workloads, Anthropic's claude-opus-4-7 ends up costing 5.3 times more than OpenAI...
Gabriel Bianconi, writing for TensorZero's blog on April 16, 2026, published a detailed breakdown of how tokenizer differences between major LLM providers create massive hidden cost gaps that standard pricing comparisons completely miss. The piece has quietly started circulating among developers building production AI systems, and the findings are the kind that should change how engineering teams budget for API costs.
Why This Matters
The price-per-million-tokens metric has become the de facto standard for comparing LLM APIs, and it is almost entirely useless for making accurate cost decisions. Bianconi's data shows effective cost differences of up to 5.3 times between providers on identical inputs, which means an engineering team that picks a model based on list price alone could be spending more than five times what they budgeted. This is not a rounding error. For any company running inference at scale, that gap is the difference between a profitable product and one that bleeds cash.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
The entire LLM pricing conversation has been built on a shaky assumption: that a token is a token, regardless of which provider you use. It is not. Every major provider, including OpenAI, Anthropic, and Google, uses its own tokenizer, which is the algorithm that splits input text into the small chunks the model actually processes. The same sentence can produce a completely different number of tokens depending on which tokenizer handles . Bianconi's team at TensorZero decided to actually measure this rather than assume. They sent identical inputs through the official token-counting APIs from each major provider and recorded the results. The inputs covered four categories: plain text drawn from Project Gutenberg's version of The Iliad, JSON data from Cloudflare's OpenAI API specification trimmed to 2 million characters, YAML from the same source, and a set of 100 synthetically generated tool definitions covering name, description, and schema.
The results were striking. Using OpenAI's gpt-5.4 as the baseline, Anthropic's claude-opus-4-7 produced 2.65 times more tokens on identical tool definitions. That means before you even factor in per-token pricing, you are already paying for 165 percent more tokens than you would with OpenAI for the exact same input. Google's gemini-3.1-pro-preview sat in the middle, producing 1.82 times more tokens than OpenAI on tool inputs but staying much closer on plain text at just 1.06 times.
When Bianconi multiplied tokenizer efficiency by each model's list price to calculate an effective cost per million real-world tokens, the rankings shifted dramatically. Gemini at $2.00 per million tokens becomes the cheapest option for text and JSON, coming in at $2.12 and $2.22 effective cost respectively. But on tool definitions, that same model jumps to $3.64 effective cost, making it 46 percent more expensive than gpt-5.4 at $2.50. Claude-opus-4-7, which lists at $5.00 per million input tokens, lands at a staggering $13.25 effective cost per million tokens on tool-heavy workloads.
The key insight here is that the cheapest provider is not a fixed answer. It depends entirely on what you are sending. A product that processes mostly plain text might find Gemini genuinely cheapest. A product built around function calling and complex tool schemas might find OpenAI cheapest despite a nominally higher list price. The only way to know is to run your actual workload through each provider's counting API and do the math.
Key Details
- Claude-opus-4-7 generates 2.65 times more tokens than gpt-5.4 on identical tool definitions.
- On tool-heavy workloads, the effective cost of claude-opus-4-7 is $13.25 per million tokens compared to $2.50 for gpt-5.4.
- That makes the real-world cost difference 5.3 times, despite list prices being only 2 times apart ($5.00 versus $2.50).
- Gemini-3.1-pro-preview is actually cheaper than gpt-5.4 for plain text, at an effective $2.12 versus $2.50.
- Gemini becomes 46 percent more expensive than gpt-5.4 on tool definitions, despite a lower list price.
- The JSON data used came from Cloudflare's OpenAI API specification, trimmed to 2 million characters, making it a realistic enterprise workload.
- Simon Willison has published more than 67 blog posts tracking LLM pricing variations, suggesting this problem is widely recognized but not yet standardized.
What's Next
Engineering teams that care about inference costs need to build tokenizer benchmarking into their model evaluation process, treating it as seriously as they treat accuracy benchmarks. Providers have little financial incentive to make this easier, so the burden falls on developers to measure it themselves using each provider's official counting APIs. Watch for third-party tooling to emerge that automates this measurement across providers, particularly as the cost differences at scale become impossible for product teams to ignore.
How This Compares
Perplexity AI's Sonar Deep Research model, introduced in 2025, illustrates exactly how much worse this problem can get when you add dimensions beyond tokenizer differences. Perplexity charges separately for reasoning tokens at $3 per million, input tokens at $2 per million, and output tokens at $8 per million, plus $5 per 1,000 search queries. A single documented API call consumed 95,305 reasoning tokens alongside just 19 prompt tokens, which means naive per-token estimates would have missed the dominant cost entirely. Bianconi's tokenizer analysis is the simpler version of a much deeper pricing complexity problem.
Compare this to the broader industry conversation Simon Willison has been tracking. His custom pricing calculator, which exists precisely because no provider offers a standardized comparison tool, reflects the same gap Bianconi is pointing at. When individual researchers have to build their own spreadsheets and calculators just to compare what two providers will actually charge for the same job, that is a market failure in transparency. The AI Agents Daily has covered similar gaps in how providers disclose context window behavior, and the tokenizer issue fits the same pattern of advertised metrics that obscure real-world costs.
What makes Bianconi's contribution distinct is the rigor of the methodology. Rather than relying on anecdotes or estimates, TensorZero used each provider's official token-counting API on real-world data sources, which makes the findings reproducible. Earlier coverage of this general issue tended toward theoretical arguments. This is empirical data, and it will be harder for the industry to dismiss.
FAQ
Q: What is a tokenizer and why does it affect my API bill? A: A tokenizer is the algorithm that breaks your input text into smaller chunks called tokens before an LLM processes it. Different providers use different tokenizers, so the same text might become 100 tokens on one platform and 265 tokens on another. Since you pay per token, a less efficient tokenizer on the same input means a higher bill.
Q: Which LLM provider has the cheapest tokenizer? A: Based on TensorZero's April 2026 analysis, OpenAI's gpt-5.4 has the most efficient tokenizer overall, especially on tool definitions. Google's Gemini is actually cheaper for plain text and JSON inputs. Anthropic's Claude models, particularly claude-opus-4-7, produce significantly more tokens than either competitor on the same content.
Q: How do I find out the real cost before committing to a provider? A: Run your actual production inputs through each provider's official token-counting API, record the token counts, and multiply by their list prices. Do this for each content type your application sends, because the cheapest provider for plain text may not be cheapest for JSON or tool definitions. There is no shortcut that replaces measuring your specific workload.
Developers building at scale can no longer afford to treat list pricing as a reliable guide to actual spending. As model capabilities converge and the competitive pressure on margins increases, understanding your real per-workload cost becomes a genuine competitive advantage. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.




