DIY AI & ML: Solving The Multi-Armed Bandit Problem with Thompson Sampling
Towards Data Science published a hands-on tutorial showing developers how to build a Thompson Sampling algorithm in Python to solve the multi-armed bandit problem. This matters because bandit algorithms power recommendation engines, pricing systems, and A/B testing at companies l...
According to Towards Data Science, the tutorial walks practitioners through constructing a Thompson Sampling algorithm object in Python and applying it to a concrete, real-world inspired scenario. The publication did not surface a named author byline in the scraped content, but the piece is part of Towards Data Science's ongoing "DIY AI and ML" series, which targets readers who want to move beyond importing library functions and actually understand the mechanics underneath. That is a genuinely useful editorial mission, and this installment earns its place in the series.
Why This Matters
Thompson Sampling is not a new idea, but most developers encounter it only as a black-box option inside A/B testing platforms or recommendation APIs. Building it yourself forces you to understand the Bayesian machinery at its core, which directly translates into better debugging and better system design when something breaks in production. DoorDash published a detailed case study in December 2025 showing how multi-armed bandit platforms replace fixed-horizon A/B testing, and companies operating at that scale need engineers who understand regret, not just p-values. The original algorithm dates to 1933, yet the first rigorous proof of its optimal performance only arrived in 2012. That 79-year gap tells you something important: intuition about this algorithm consistently outpaces the formal theory, and practitioners who build it themselves develop that intuition faster.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
The multi-armed bandit problem takes its name from a casino floor. Imagine standing in front of several slot machines, each with an unknown payout rate, and having a limited number of pulls to maximize your total winnings. You need to explore enough machines to learn which one pays best, but you also need to exploit the best machine before your budget runs out. This tension between exploration and exploitation shows up everywhere in technology: which ad to show, which recommendation to surface, which price to test.
Thompson Sampling, first described by statistician W. R. Thompson in 1933, solves this problem with a Bayesian approach. Rather than tracking a single estimate for each arm's reward rate, the algorithm maintains a full probability distribution over each arm's possible payout rates. At every decision point, it samples a value from each distribution and picks the arm whose sampled value is highest. Arms the algorithm is uncertain about have wide distributions and occasionally produce high samples, which drives natural exploration. Arms with proven track records have tight distributions centered on high values, which drives exploitation. The balance emerges automatically from the math, with no tuning required.
The Towards Data Science tutorial operationalizes this into a Python class, giving readers a concrete object they can instantiate, update, and query. The DIY framing is intentional. Wrapping the algorithm in an object forces the author to think about state, specifically which distributions are being maintained, how they get updated after each observation, and what interface the rest of an application needs to interact with the bandit. Those engineering decisions are invisible when you call a third-party API.
The theoretical performance of Thompson Sampling is well characterized. Shipra Agrawal and Navin Goyal established formal regret bounds at the 25th Annual Conference on Learning Theory in 2012. For a two-armed bandit problem, the expected regret over time T is O(ln T divided by delta, plus 1 divided by delta cubed), where delta is the performance gap between the best and second-best arm. These bounds are considered optimal, and researchers note that Thompson Sampling achieves smaller constants than competing approaches like Upper Confidence Bound algorithms.
Real-world deployment at scale looks like DoorDash's platform, where the company replaced conventional A/B testing workflows that required predetermined sample sizes and fixed traffic splits. Under traditional testing, you commit traffic to a losing variant until the experiment ends, because stopping early violates the statistical assumptions. Bandit algorithms break that constraint by continuously reallocating traffic toward better-performing variants as evidence accumulates, cutting both the time to decision and the cumulative exposure to inferior experiences.
Key Details
- Thompson Sampling was originally formulated by W. R. Thompson in 1933, making it over 90 years old.
- The first rigorous logarithmic regret proof was published by Agrawal and Goyal at the 25th Annual Conference on Learning Theory in 2012.
- For an N-armed bandit problem, expected regret scales as O(sum of 1 divided by each arm's delta squared, multiplied by ln T).
- DoorDash published a multi-armed bandit platform case study in December 2025, citing fixed traffic allocation as a primary bottleneck in A/B testing workflows.
- A Stanford University tutorial co-authored by researchers from Columbia University, Google DeepMind, and Adobe Research covers Thompson Sampling applications across at least 6 problem types, including product recommendation, assortment optimization, and reinforcement learning.
- The Towards Data Science DIY series targets practitioners building algorithm objects from scratch rather than importing pre-built solutions.
What's Next
Developers who complete this tutorial have a foundation they can extend toward contextual bandits, which add side information about the user or environment to each decision, and linear bandits, which handle high-dimensional action spaces more efficiently. The next practical step for most teams is integrating a Python bandit implementation into an actual experimentation pipeline, comparing its convergence speed against a fixed A/B test on the same traffic. Watch for more industry case studies in 2025 and 2026 as companies that adopted bandit platforms in the last two years start publishing performance data from production deployments.
How This Compares
The Towards Data Science tutorial sits at the educational end of a spectrum that runs all the way to DoorDash's December 2025 enterprise case study. Both address the same core algorithm, but they serve completely different audiences. The DIY tutorial builds conceptual fluency. The DoorDash writeup describes traffic orchestration infrastructure. You need both, and the fact that practitioners are writing about this from both angles simultaneously suggests the industry is in a healthy adoption phase, not just theoretical exploration.
Compare this to how UCB algorithms are typically taught. Upper Confidence Bound methods are mathematically intuitive, you add an exploration bonus to each arm's estimated value, but they require explicit tuning of the confidence parameter. Thompson Sampling sidesteps that tuning entirely by letting the posterior distributions do the work. The Erasmus Mundus research comparing Thompson Sampling, UCB, and the Gittins Index for intelligent tutoring systems found that Thompson Sampling's computational efficiency makes it more practical than the theoretically optimal Gittins Index, which becomes intractable for large action spaces. For a practitioner building a Python tool from scratch, that tradeoff matters enormously.
The broader context here is the democratization of Bayesian methods. Five years ago, building a Thompson Sampling implementation required graduate-level comfort with probability distributions. Today, Towards Data Science tutorials and accessible Python guides lower that bar to motivated intermediate developers. That shift is producing a generation of engineers who understand uncertainty quantification not as an academic concept but as a practical design choice. That is exactly the skill set production AI systems increasingly demand.
FAQ
Q: What is the multi-armed bandit problem in simple terms? A: It is a decision-making scenario where you choose repeatedly from several options with unknown reward rates, trying to maximize total rewards. The core challenge is balancing trying new options to learn their quality against sticking with options you already know are good. The name comes from slot machines, which are sometimes called one-armed bandits.
Q: How is Thompson Sampling different from a standard A/B test? A: A standard A/B test splits traffic evenly and holds that split fixed until a predetermined sample size is reached. Thompson Sampling continuously updates its beliefs about each option and automatically sends more traffic toward better-performing variants in real time, reducing the total exposure to inferior options and often reaching a conclusion faster.
Q: Do I need a statistics background to build this in Python? A: A basic understanding of probability distributions helps, particularly the Beta distribution used for Bernoulli reward problems. However, the Towards Data Science tutorial frames it through a practical, object-oriented lens, and most developers with intermediate Python skills can follow along and build a working implementation by treating the distribution updates as straightforward formulas first and deepening the math later.
The intersection of classical Bayesian statistics and modern software engineering is producing some of the most useful applied AI content available right now, and this tutorial is a solid example of that trend. Practitioners who invest time in building these fundamentals from scratch will have a meaningful edge when production systems need debugging or extension. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.




