LLMThursday, April 16, 2026·8 min read

Grpo explained: group relative policy optimization for LLM finetuning

AI Agents Daily

Curated by AI Agents Daily team · Source: HN LLM

Grpo explained: group relative policy optimization for LLM finetuning

Why This Matters

A technique called Group Relative Policy Optimization, or GRPO, is the core training algorithm behind today's top reasoning models from companies like Anthropic, OpenAI, and Google. It cuts the computational cost of reinforcement learning roughly in half by replacing a separate v...

Writing for cgft.io on April 9, 2026, author gk published a detailed technical breakdown of GRPO, the reinforcement learning algorithm quietly powering frontier reasoning models like Claude Opus 4.6, GPT-5.4, and Google's Gemini thinking series. The piece digs into exactly why this method has become the go-to approach for teaching language models to handle math, code, and logic at a competitive level, and it explains the core mechanics clearly enough that both engineers and curious readers can follow along.

Why This Matters

GRPO is not a minor tweak to existing training pipelines. It cuts training compute requirements by roughly 50 percent compared to methods that rely on a separate value model, which means organizations that previously could not afford to run reinforcement learning on large models now can. DeepSeek's decision to publish GRPO-based training details with DeepSeek-R1 in January 2025 triggered a wave of adoption across the research community, and that wave is still building. If you want to understand why reasoning models have improved faster in the past 18 months than in the previous 5 years, GRPO is a large part of the answer.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

Supervised fine-tuning has been the workhorse of LLM improvement for years. You show the model thousands of correct examples and it learns to imitate them. The problem is that imitation breaks down for tasks where the right answer is verifiable but the path to it is not obvious. Math problems, coding challenges, and formal logic puzzles all fall into this category. You do not need to demonstrate every correct reasoning chain to a model. You just need to reward it when it lands on a correct answer and let it figure out the route on its own.

That is the job reinforcement learning does for these models. The model generates a response, receives a reward signal based on whether the answer is correct, and updates its internal weights to make correct paths more likely in the future. Simple in theory, but RL training for language models runs into a fundamental problem: rewards only mean something in relation to a baseline. If every response scores a 0.9 out of 1.0, the model has no way to distinguish which responses are worth repeating. You need to tell it what "normal" looks like before it can identify what "above average" looks like.

The classic solution, used in Proximal Policy Optimization, is to train a separate value model, sometimes called a critic, that estimates how good a response should be. That critic runs alongside the main language model during training and provides the baseline. The problem is that maintaining two large models during training roughly doubles your compute and memory requirements. For models with billions of parameters, that cost is significant.

GRPO takes a different route. Instead of maintaining a separate critic model, it generates a group of responses, typically labeled G responses, to the same prompt and uses the average score of that group as the baseline. If you ask the model to solve a math problem and it produces 4 answers, with scores of 1.0, 0.0, 1.0, and 0.5, the group average is 0.625. Each response's advantage is simply its score minus that average. The two correct answers at 1.0 each get a positive advantage of 0.375, making them more likely in future training. The wrong answer at 0.0 gets a negative advantage of 0.625, making it less likely. The borderline response at 0.5 gets a slight negative push. No second model required.

The reward function that feeds into this process is deliberately simple for reasoning tasks. Correctness is judged by checking whether the final answer matches the ground truth, which for competition math means string matching against known solutions. Format compliance is checked by rule rather than by a trained model. Code either passes a set of unit tests or it does not. This keeps the training pipeline clean and removes one of the most expensive components in standard RLHF pipelines, which is the trained reward model.

Key Details

Author gk published this explainer on April 9, 2026, estimating a 6-minute read time.
GRPO was first introduced in the DeepSeekMath paper before being applied to DeepSeek-R1, which launched publicly in January 2025.
The method eliminates the need for a separate value model, cutting training compute requirements compared to PPO-based approaches.
In the GRPO framework, a group size G determines how many responses are sampled per prompt before computing the baseline.
In October 2025, researchers including Tristan Li from Tencent published a training-free variant of GRPO on arXiv that eliminates parameter updates entirely for specialized domain tasks.
Tencent's team released source code for their training-free implementation on GitHub alongside the paper.

What's Next

The Tencent training-free GRPO variant published in October 2025 points toward a logical next frontier: applying GRPO-style relative optimization to domains where even reward signal design is expensive, such as open-ended writing or multi-step tool use. Watch for implementations that combine GRPO with external tool feedback, since the algorithm's reliance on verifiable rewards makes it a natural fit for agentic tasks where code execution or API calls provide automatic correctness signals. Educational resources from DataCamp, academic blogs, and Substack researchers have already accelerated adoption, so expect GRPO to show up in more open-source fine-tuning guides and tutorials throughout 2026.

How This Compares

PPO was the default RL algorithm for major language model training before GRPO gained prominence, and the contrast between them is stark. PPO requires a reference model running in parallel throughout training, which is manageable when you have the infrastructure of OpenAI or Google but genuinely prohibitive for smaller labs and university research groups. GRPO's group-averaging trick achieves similar policy optimization goals without that overhead. The tradeoff is that GRPO's baseline quality depends on getting a reasonably diverse group of outputs per prompt, which means it needs a model that is already capable enough to produce varied responses. Early in training, this can be a limitation.

Compare GRPO to Direct Preference Optimization, or DPO, which became popular in 2023 as another way to sidestep the complexity of PPO. DPO works from pre-collected preference pairs and does not require online sampling during training, which makes it simpler to implement. However, DPO does not have the model actively exploring new reasoning paths during training. GRPO generates its own training data on the fly by sampling from the current policy, which means the model is constantly testing new strategies rather than learning from a fixed dataset. For reasoning tasks specifically, that online exploration appears to matter a great deal based on DeepSeek-R1's benchmark results.

The Tencent training-free variant from October 2025 is the most interesting recent development in this space. By removing parameter updates entirely, it opens up GRPO-style relative evaluation to inference-time and prompting scenarios, not just training pipelines. That is a different class of application, closer to chain-of-thought prompting than to fine-tuning, and it suggests the core insight behind GRPO, which is that group-relative scoring is a powerful signal, has value far beyond the specific training setup where it was first demonstrated. For a closer look at the tools and platforms being built around these techniques, the options are growing quickly.

FAQ

Q: What is GRPO and how does it differ from standard RL training? A: GRPO stands for Group Relative Policy Optimization. Standard RL training for language models typically requires a separate value model to estimate how good a response should be. GRPO replaces that separate model by sampling multiple responses to the same prompt and using their average score as the baseline, which reduces compute requirements substantially.

Q: Why do reasoning models need reinforcement learning at all? A: Supervised fine-tuning teaches a model to copy examples, which works well when you can demonstrate every correct answer. Reinforcement learning teaches a model to optimize for an outcome, which works better for tasks like math or coding where the correct answer is easy to verify but the reasoning path is hard to demonstrate in advance.

Q: Which models are actually trained using GRPO? A: DeepSeek-R1, released in January 2025, was among the first high-profile models to use GRPO-based training and attracted significant attention for its reasoning performance. The technique has since influenced training approaches at multiple organizations, with frontier models including those in the Claude, GPT, and Gemini families incorporating RL-based reasoning training.

GRPO represents a genuine and durable shift in how the industry approaches reasoning model development, not a research curiosity that will fade once the next paper drops. As more organizations publish GRPO-based implementations and the training-free variants mature, the barrier to building capable reasoning systems will continue to fall. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. We are tracking this development closely and will report on follow-up impacts as they emerge.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

Grpo explained: group relative policy optimization for LLM finetuning

Why This Matters

The Full Story

Key Details

What's Next

How This Compares

FAQ

Get stories like this daily

More in LLM

How an LLM becomes more coherent as we train it

I Tried the LLM Wiki and RAG on Todays News from BBC, CNN, Euronews

Show HN: Preseason – see which developer tools each LLM picks

Learn more — Guides