ToolsFriday, April 17, 2026·9 min read

6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

AI Agents Daily

Curated by AI Agents Daily team · Source: Towards Data Sci

6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

Why This Matters

A hands-on developer published a deep-dive on Towards Data Science revealing 6 technical lessons from building GPT-2 from scratch that standard tutorials consistently skip. These findings cover critical topics like rank-stabilized LoRA, rotary positional embeddings, and quantizat...

According to Towards Data Science, a practitioner who built GPT-2 entirely from scratch using PyTorch documented 6 architectural and statistical lessons that conventional machine learning courses simply do not teach. The piece zeroes in on optimizations powering modern Transformer architectures, covering everything from how standard LoRA breaks under certain conditions to why rotary positional embeddings beat sinusoidal approaches in real deployments. No author byline was recoverable from the source page, but the publication itself is Towards Data Science, one of the most widely read applied ML outlets on the internet.

Why This Matters

The gap between "I finished the tutorial" and "I can build something that works at scale" is where most ML practitioners get stuck, and this article attacks that gap directly with statistical evidence rather than hand-waving. Standard LoRA tutorials, which millions of developers have followed, contain a fundamental flaw that actually hurts performance as you increase rank dimension, and most people using it have no idea. With the fine-tuning tooling market now worth billions and parameter-efficient methods like LoRA baked into frameworks used by tens of thousands of production teams, understanding variance collapse at higher ranks is not academic trivia. It is the difference between a model that degrades quietly and one that performs as expected.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

The author implemented GPT-2 from scratch in PyTorch, a process that forces you to confront problems that pre-packaged libraries quietly paper over. The first major finding involves the difference between standard LoRA and RsLoRA, which stands for Rank-Stabilized LoRA. Standard LoRA works by decomposing weight updates into two low-rank matrices, and most tutorials present this as a reliable method for efficient fine-tuning. The problem is that as you increase the rank of those matrices trying to capture more model capacity, the variance of the adapted weights actually shrinks. This counterintuitive behavior means you are getting less expressive fine-tuning right at the moment you expect more. RsLoRA fixes this through a mathematical correction that holds variance stable across different rank values.

The second lesson focuses on positional encoding. The author compared three approaches: learned positional parameters, sinusoidal embeddings, and RoPE, short for Rotary Position Embedding. RoPE encodes position through rotation matrices in the complex plane, and it consistently outperformed the other two approaches in practical tests. The critical advantage is that RoPE preserves relative position information more reliably and generalizes better to sequence lengths longer than those seen during training. Most introductory resources still present sinusoidal embeddings as the default baseline without ever explaining why production systems have largely moved past them.

Weight tying was the third lesson, and the insight here is all about context. Weight tying is the practice of sharing parameter matrices between different model components, typically the input embedding and the output projection layer. Tutorials tend to present it as either always good or always bad. The practical reality is more nuanced: specific architectural conditions determine whether weight tying helps or hurts, and understanding those conditions requires the kind of trial-and-error that only comes from actually building the model yourself rather than reading about . Quantization stability was the fourth lesson and arguably the most practically urgent given the current push toward smaller, deployable models. When you reduce the numerical precision of model weights and activations, you introduce challenges that existing literature does not fully address. The author documented specific failure modes including gradient flow degradation and activation range instability that only become apparent during actual implementation. These are the problems that make a quantized model behave erratically in production even when it looked fine during evaluation.

The fifth and sixth lessons address additional architectural refinements around training stability and computational efficiency discovered through real-world experimentation. The through-line across all 6 findings is that empirical implementation reveals problems that theory predicts poorly, and the machine learning community develops much of its practical knowledge through shared builder experience before that knowledge ever makes it into formal curricula or research papers.

Key Details

The author implemented GPT-2, OpenAI's 2019 language model, from scratch using PyTorch as the foundation for these findings.
Standard LoRA's variance collapse flaw occurs specifically as rank dimension increases, undermining the method's scalability.
RsLoRA corrects this flaw through a mathematical normalization that stabilizes variance across all rank values.
RoPE outperformed both learned positional parameters and sinusoidal embeddings in practical tests, particularly on sequences longer than training length.
The article was published by Towards Data Science, which reaches over 1 million monthly readers according to the publication's own figures.
Quantization stability issues, specifically gradient flow degradation, were documented as problems not adequately covered in existing literature as of April 2026.

What's Next

Expect more practitioners to publish implementation-based findings as the tooling around LLM development matures and more developers attempt production builds beyond tutorial reproductions. The RsLoRA findings in particular should push fine-tuning framework maintainers to review default rank settings in tools like Hugging Face PEFT, where many teams currently rely on configurations inherited from early LoRA papers. Watch for follow-up work on quantization stability specifically, as the race toward sub-7B models that run on consumer hardware makes numerical precision a front-line engineering problem in 2026.

How This Compares

Compare this to the wave of "build GPT from scratch" content that Andrej Karpathy's nanoGPT project and his YouTube series kicked off in 2022 and 2023. Those resources are excellent for foundations, but they deliberately simplify configurations to maximize clarity. What this Towards Data Science piece does differently is report on what breaks when you try to push past those simplified configurations, which is exactly the next chapter that the builder community needed. Karpathy's work answers "how does a Transformer work." This article answers "what actually goes wrong when you try to make it production-grade."

The RoPE findings also align with what researchers at Meta and DeepMind have published about positional encoding in models like LLaMA 2 and Gemma. Both architectures shipped with RoPE as the default, but the published papers focus on benchmark results rather than explaining why the alternative approaches fail in practice. This article fills that explanatory gap in a way that is accessible to practitioners who are not reading arXiv daily.

On the LoRA side, the variance collapse issue is particularly relevant given how heavily the fine-tuning ecosystem has leaned on LoRA since Microsoft's original 2021 paper. Hugging Face's PEFT library, which is the dominant tool for parameter-efficient fine-tuning across thousands of projects tracked on the AI Agents Daily tools directory, implements standard LoRA by default. The RsLoRA correction has existed in research literature, but its absence from most practical guides and tutorials means a huge portion of fine-tuning work is running with a suboptimal default. That is a real problem with a known fix, and articles like this one accelerate adoption.

FAQ

Q: What is LoRA and why does rank matter for fine-tuning? A: LoRA is a technique that fine-tunes large language models efficiently by adding small trainable matrices instead of updating all parameters. The rank of those matrices controls how much expressive capacity the update has. Higher rank should mean better fine-tuning, but standard LoRA has a statistical flaw where increasing rank actually reduces the variance of weight updates, which weakens the adaptation rather than improving . Q: Why do most tutorials use sinusoidal embeddings instead of RoPE? A: Sinusoidal embeddings were introduced in the original 2017 "Attention Is All You Need" paper and became the default teaching example for positional encoding. RoPE was developed later and requires understanding rotation matrices in the complex plane to explain properly, so most introductory materials stick with the simpler historical baseline even though production systems have moved . Q: What does quantization stability mean in plain language? A: Quantization means converting a model's high-precision floating-point numbers into lower-precision integers to reduce memory and speed up inference. Stability refers to whether the model still behaves predictably after that conversion. Instability shows up as erratic outputs or training failure, caused by issues like gradients not flowing properly or certain activation values exceeding the range that low-precision formats can represent.

The practical knowledge gap in LLM development is real, and first-person accounts from developers who have shipped working implementations from scratch remain one of the most valuable resources the community produces. As more teams move from experimenting with hosted APIs to building and fine-tuning their own models, this kind of ground-level engineering insight will only grow in importance. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. For builders evaluating their AI stack, this is worth watching closely.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

Why This Matters

The Full Story

Key Details

What's Next

How This Compares

FAQ

Get stories like this daily

More in Tools

Claude Token Counter, now with model comparisons

Learn more — Guides