LLMSunday, April 19, 2026·8 min read

Speculative decoding question, 665% speed increase

AI Agents Daily

Curated by AI Agents Daily team · Source: Reddit LocalLLaMA

Speculative decoding question, 665% speed increase

Why This Matters

A developer running local AI models discovered that speculative decoding produces wildly different speed boosts depending on which model you use, with one model called Devstrall small hitting a 665 percent speed increase while others barely moved the needle. This matters because ...

A user in the Reddit LocalLLaMA community recently posted a genuine head-scratcher that has since sparked a deeper conversation about how speculative decoding actually works in practice. According to the LocalLLaMA subreddit post, the experimenter was running llama.cpp with a specific ngram-based speculative decoding configuration and noticed that three different models, given the same task of processing minor code changes, produced speed improvements ranging from 40 percent all the way to 665 percent. The post, which has since been edited to include a repeat-penalty parameter, raised the obvious question: why would the same technique produce such wildly different results?

Why This Matters

A 665 percent speed increase is not a rounding error. That figure means the model is running at roughly 7.65 times its baseline token generation rate, which for local inference on consumer hardware is the difference between a tool you actually use and one you abandon out of frustration. The 40 percent improvement seen with Qwen 3.6 is still meaningful, but it illustrates that developers cannot simply assume speculative decoding will save them across the board. The variance here is so large that model selection, not just hardware, becomes the primary lever for performance optimization in local AI deployments.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

Speculative decoding works by pairing a fast, lightweight draft mechanism with a slower, more capable target model. The draft mechanism, in this case an ngram-based predictor, guesses the next several tokens ahead of time. The larger model then verifies those guesses in a single parallel pass rather than generating each token one at a time. When the guesses are right, you get multiple tokens for the computational cost of roughly one. When they are wrong, the system resets and tries again.

The llama.cpp configuration in question used spec-type ngram-map-k with a context window of 24 tokens for n-gram prediction, a minimum draft length of 12 tokens, and a maximum draft length of 48 tokens. Those parameters tell the system to look back 24 tokens when predicting what comes next, and to attempt batches of 12 to 48 tokens per speculative cycle. On paper, that is an aggressive configuration designed to maximize parallelism. In practice, the results depended almost entirely on the model being used.

Gemma 4 31b doubled its token generation speed, a 100 percent improvement that cuts inference cost per token in half and represents solid, consistent gains. Qwen 3.6 only managed 40 percent, which suggests that its attention mechanisms or vocabulary usage patterns are harder for an ngram predictor to anticipate accurately. Every time the draft mechanism guesses wrong with Qwen 3.6, the system resets, and those resets eat into the theoretical gains from parallelism.

Devstrall small is the anomaly. A 665 percent speed increase implies that the ngram predictor is correctly anticipating the model's output across long token sequences with very high accuracy. For that to happen, Devstrall small's outputs in the context of minor code changes must be highly regular and predictable given a 24-token window. The model is smaller, so its output distributions are simpler, and simpler distributions are exactly what ngram predictors are built to exploit. The user's addition of repeat-penalty 1 likely helped further by discouraging repetitive token loops that would otherwise confuse the draft mechanism.

The broader implication here is that speculative decoding effectiveness is a property of the interaction between the draft mechanism and the specific model's learned behavior, not a universal multiplier you can apply to any inference stack. Infrastructure teams deploying these techniques at scale need to benchmark each model individually rather than extrapolating from published results.

Key Details

Devstrall small achieved a 665 percent speed increase, equivalent to 7.65 times baseline token generation speed.
Gemma 4 31b achieved a 100 percent speed increase, doubling token output.
Qwen 3.6 achieved a 40 percent speed increase, the lowest of the three tested models.
The llama.cpp configuration used spec-ngram-size-n 24, draft-min 12, and draft-max 48.
The task used for all three comparisons was processing minor code changes, keeping the prompt type constant.
The user added repeat-penalty 1 in an edit, which adjusts sampling to reduce token repetition.

What's Next

Developers experimenting with local inference should treat speculative decoding configuration as a per-model tuning exercise rather than a set-and-forget optimization. The ngram-size-n and draft-min/max parameters may need to be adjusted independently for each model to find the configuration that maximizes accepted draft tokens per cycle. As more models are released with explicit speculative decoding benchmarks, expect community-maintained compatibility tables to emerge that map specific llama.cpp configurations to real-world speed multipliers for popular models.

How This Compares

Intel Labs and the Weizmann Institute of Science presented research at the International Conference on Machine Learning in Vancouver in 2025 that demonstrated practical speedups of up to 2.8 times faster LLM inference using a technique that allows any small draft model to accelerate any large language model regardless of vocabulary differences. Senior researcher Oren Pereg described it as turning "speculative acceleration into a universal tool." The Devstrall small result of 665 percent blows past that 2.8x benchmark, though the comparison is not entirely apples-to-apples since Intel's figure represents a generalized cross-model result while the Devstrall number reflects a specific model under ideal conditions for ngram prediction. Still, it suggests that under the right circumstances, community-run local inference setups can exceed what controlled lab research documents as typical.

IBM Technology's educational coverage published June 4, 2025 set industry expectations at 2 to 4 times speedups as the general range for speculative decoding without quality loss. The Qwen 3.6 result of 1.4 times sits below that floor, while Devstrall small sits nearly twice above the ceiling. This range of outcomes from a single configuration on a single task type is precisely why IBM's 2 to 4 times framing should be understood as a rough average, not a guarantee.

Perhaps most interesting is the research submitted to arXiv on March 3, 2026 by Tanishq Kumar, Tri Dao, and Avner May, which introduces "Speculative Speculative Decoding," a technique that parallelizes the verification step itself by having the draft model simultaneously predict likely verification outcomes. If that technique can be layered on top of the ngram approach used in llama.cpp, models like Devstrall small that already show high draft acceptance rates could see compounding gains. That is a real development worth watching for anyone building AI tools and platforms around local inference.

FAQ

Q: What is speculative decoding and how does it speed up AI models? A: Speculative decoding pairs a fast, simple draft mechanism with a slower main model. The draft mechanism guesses several tokens ahead, and the main model verifies those guesses all at once instead of one at a time. When the guesses are correct, you get multiple tokens for roughly the computational cost of generating one, which speeds up the overall output rate.

Q: Why did Devstrall small get so much faster than Qwen 3.6? A: Smaller models tend to produce more predictable token sequences, especially on structured tasks like code editing. The ngram predictor used in this experiment looks back 24 tokens to guess what comes next, and Devstrall small's outputs were apparently regular enough that those guesses were correct far more often than with Qwen 3.6, where mismatches reset the draft cycle and cut into gains.

Q: Can I use speculative decoding on my own local AI setup? A: Yes, llama.cpp supports ngram-based speculative decoding through command-line parameters like spec-type, spec-ngram-size-n, draft-min, and draft-max. Results will vary significantly by model, so you should benchmark each model you plan to run. Check our guides section for practical walkthroughs on configuring llama.cpp for local inference optimization.

The gap between a 40 percent and 665 percent speed improvement from the same configuration is a reminder that local AI inference is still very much an empirical game where testing beats assumption every time. As the community continues to document these model-specific results, developers will have a clearer map for choosing models that actually perform well under their hardware constraints. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. We are tracking this development closely and will report on follow-up impacts as they emerge.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

Speculative decoding question, 665% speed increase

Why This Matters

The Full Story

Key Details

What's Next

How This Compares

FAQ

Get stories like this daily

More in LLM

Gemma-4-E2B's safety filters make it unusable for emergencies

Why doesn't any OSS tool treat llama.cpp as a first class citizen?

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Learn more — Guides