LLMSaturday, April 11, 2026·8 min read

Open Source LLM Comparison – Is Opus Cooked?

AI Agents Daily

Curated by AI Agents Daily team · Source: HN LLM

Open Source LLM Comparison – Is Opus Cooked?

Why This Matters

A new benchmark published at paradise-runner.github.io put six open-weight frontier models head-to-head against Claude Opus 4.6, and the results raise real questions about whether Anthropic's flagship model justifies its price tag. StepFun's Step 3.5 Flash scored a value-to-cost ...

According to the benchmark published at paradise-runner.github.io by the project author "paradise-runner," six open-weight large language models were tested across two real-world task suites, a marketing design challenge and a calendar update tool-calling test, with the entire experiment costing just $0.34 in total API spend. The comparison puts Claude Opus 4.6, priced at $20 per month under a subscription model, directly against a field of cheaper, API-accessible competitors that are closing the quality gap faster than most developers expected.

Why This Matters

The cost-versus-performance math here is brutal for Anthropic. When StepFun's Step 3.5 Flash achieves a score of 90 out of 100 on the marketing design task while running at $0.032 per million input tokens, and Kimi K2.5 scores a 98 on the same task at $0.60 per million tokens, the question of whether Opus earns its premium becomes genuinely difficult to answer. Claude Opus 4.6 scored a perfect 100 on the marketing design benchmark, but that 2-point advantage over Kimi costs developers orders of magnitude more per run. For teams running hundreds of thousands of agent calls per month, this is not a philosophical debate, it is a line item on a real budget.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

The benchmark project is straightforward in its methodology and honest about its scope. Six models were tested against identical prompts across two suites, with every raw output linked publicly so readers can verify the results themselves. That transparency is refreshing in a space where benchmark manipulation is a legitimate concern.

The marketing design task was scored on a 100-point scale, and the results bunched together tightly at the top. Claude Opus 4.6 led with a perfect 100. Kimi K2.5 from Moonshot AI came in at 98. Z-AI's GLM 5.1 and MiniMax M2.7 both scored 95. Alibaba's Qwen 3.6 Plus landed at 92, and StepFun's Step 3.5 Flash brought up the rear at 90. In terms of raw quality, every model in this comparison delivered competitive output on a design task.

Where things get interesting is the cost column. Opus at $20 per month is a flat subscription, which means its cost per run depends entirely on how heavily you use it. The benchmark estimates Opus consumed roughly 8 percent of a five-hour usage window for its test. Step 3.5 Flash, by contrast, ran the same marketing design task for $0.00488 per run, with input tokens priced at $0.032 per million and output at $0.30 per million. That is a price difference that changes how you architect entire products.

The calendar update test, which measures tool-calling and agentic capability, tells its own story. Only Step 3.5 Flash received a passing grade in that suite, costing $0.0026 per run. None of the other models in the comparison passed this test, including Opus, which has no recorded result in the calendar update column at all. The benchmark author's recommendation for AI assistant and agent use cases goes unambiguously to Step 3.5 Flash, described as "great at tool calling and hella cheap." For frontend design, Kimi K2.5 takes the top recommendation, praised for producing cohesive, polished webpages without the visual tells that mark AI-generated layouts, things like overuse of emoji formatting and broken code block rendering.

The project also notes that MiniMax M2.7 delivers a score-to-dollar ratio of 3,571 and Qwen 3.6 Plus hits 3,321, both representing strong middle-ground options. GLM 5.1 and Kimi K2.5 scored lower on value efficiency despite their strong raw quality numbers, primarily because their per-run costs run higher than the ultra-cheap contenders. DeepSeek 3.2 appears in the pricing table at $0.26 per million input tokens and $0.38 per million output tokens but does not have recorded benchmark scores in the current results set.

Key Details

6 models tested with a total experiment cost of $0.34 across 2 task suites
Claude Opus 4.6 scored 100 on the marketing design task, priced at $20 per month subscription
StepFun Step 3.5 Flash scored 90 on marketing design at $0.00488 per run and was the only model to pass the calendar update tool-calling test at $0.0026 per run
Kimi K2.5 scored 98 on marketing design at $0.119 per run, priced at $0.60 per million input tokens and $3.00 per million output tokens
MiniMax M2.7 achieved a score-to-cost ratio of 3,571, the second highest in the comparison
Step 3.5 Flash achieved a score-to-cost ratio of 18,442, far ahead of every other tested model
Qwen 3.6 Plus from Alibaba scored 92 on marketing design at $0.0277 per run
DeepSeek 3.2 listed at $0.26 input and $0.38 output per million tokens with no benchmark scores recorded yet

What's Next

The benchmark is live and linked to raw model outputs, so the developer community can extend it with additional task suites beyond the current two. Watch for the DeepSeek 3.2 results to populate, since its pricing puts it in a competitive spot that could reshuffle the value rankings. If Step 3.5 Flash continues to pass agentic tool-calling tests that Opus is not even entered in, Anthropic will face mounting pressure to clarify exactly which use cases justify the premium tier.

How This Compares

This benchmark arrives in the context of a much larger shift in open-weight model economics. Separate testing reported by Fabio Akita for akitaonrails.com in April 2026 found that only four model families produced functional, non-hallucinated code in production-level benchmarks: Claude Sonnet 4.6, Claude Opus 4.6, Z-AI's GLM 5 and 5.1, and GPT 5.4. That finding reinforces the paradise-runner benchmark's quality results for GLM 5.1, which scored a 95 on marketing design. Akita's work also flagged that models including Kimi, DeepSeek, MiniMax, Qwen variants, Gemini, and Grok 4.20 failed code generation by inventing non-existent APIs, a much harder failure mode than a lower design score.

The pricing pressure on Opus is not coming from a single challenger. In January 2025, DeepSeek released a reasoning model under MIT license that matched OpenAI's o1 on most benchmarks at a reported training cost of $5.9 million, triggering a single-day drop of $589 billion in Nvidia's market capitalization. That event signaled to the industry that the economic moat around frontier closed models was narrowing. The paradise-runner comparison is a small but clear data point in that 18-month trend.

On the ensemble side, LLM Consensus published results on April 2, 2026 from its Expert-Domain Evaluation Benchmark v1.0, testing 100 high-complexity questions across law, clinical medicine, financial regulation, and technical architecture. Their multi-model consensus system matched or outperformed individual top models including GPT 5.4, Claude Opus 4.6, and Gemini 3.1 Pro across all evaluated questions, with measurable improvement in 44.9 percent of cases and zero instances of degradation. That approach, combining multiple cheaper models rather than paying for one expensive flagship, is exactly what benchmarks like this one implicitly argue for. If Step 3.5 Flash handles tool calling, Kimi handles design, and a consensus layer handles complex reasoning, the case for paying Opus-level pricing weakens considerably.

FAQ

Q: Is Claude Opus 4.6 still worth using for AI agents? A: For general agent and tool-calling tasks, the benchmark evidence suggests Step 3.5 Flash is a stronger choice at a fraction of the cost. Opus 4.6 scored the highest on raw design quality, but it did not even record a result on the calendar update tool-calling test, which Step 3.5 Flash passed cleanly at $0.0026 per run.

Q: What is Step 3.5 Flash and who makes it? A: Step 3.5 Flash is a model from StepFun, a Chinese AI lab. In this benchmark it scored 90 on the marketing design task and was the only model to pass the calendar update agentic test. Its input token pricing sits at $0.032 per million, making it one of the most cost-efficient frontier models currently available through an API.

Q: How reliable are benchmarks like this one? A: Single-benchmark results should be treated as directional, not definitive. This particular benchmark is transparent, with every raw model output linked publicly, and the total test cost of $0.34 keeps the methodology reproducible. Cross-referencing against independent tests, like Akita's April 2026 code generation results, strengthens confidence when findings align.

The open-weight model field is moving fast enough that any "winner" declared today should be re-evaluated in 90 days. Developers building AI tools and agent pipelines should treat cost-per-task as a first-class metric alongside quality scores, and consult practical guides for structuring multi-model evaluations before committing to a single provider. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. We are tracking this development closely and will report on follow-up impacts as they emerge.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

Open Source LLM Comparison – Is Opus Cooked?

Why This Matters

The Full Story

Key Details

What's Next

How This Compares

FAQ

Get stories like this daily

More in LLM

Milla Jovovich's New Open Source LLM Memory App and the Dark Code Problem

Your intuition of LLM token usage might be wrong

Show HN: Bloomberg Terminal for LLM ops – free and open source

Learn more — Guides