GRPO is the reinforcement learning method behind DeepSeek-R1 and a wave of reasoning models. Its trick: throw away PPO’s value network and score each response against a group of samples for the same prompt. Here is how it works, how it compares to PPO, and where it runs out of road.
GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm for training large language models. It was introduced by DeepSeek in the DeepSeekMath paper (2024) and brought to wide attention by DeepSeek-R1 (January 2025), which used it to train a frontier reasoning model largely through RL.
The core idea is simple. GRPO is closely related to PPO (Proximal Policy Optimization), but it removes PPO’s separately trained value model, or critic. Instead of asking a learned network to estimate how good a response is, GRPO samples a group of responses to the same prompt, scores each one, and judges every response relative to the group average. A response that beats the group mean is reinforced; one that lags behind is discouraged.
2024
Introduced
in the
DeepSeekMath
Paper
2025
Scaled by
Deep-Seek R1,
a watershed for
RL Reasoning
Group-relative
advantage from
a group of samples,
not a learning
value function
For each training prompt, GRPO runs four steps:
How GRPO differs from the PPO setup it is based on.
| Dimension | PPO | GRPO |
|---|---|---|
| Advantage estimate | Learned value model (critic) | Group-relative: reward normalized within a group of samples |
| Models in memory | Policy, reference, value model | Policy and reference only |
| Compute / memory | Higher (extra critic to train) | Lower (no critic) |
| Best fit | General RL, dense or shaped rewards | Verifiable rewards, reasoning, math, code |
| Main cost | Training and tuning the critic | Sampling a group per prompt |
GRPO made large-scale RL for reasoning practical. By dropping the critic, it removed a major source of instability and cost in RLHF-style pipelines, and it pairs naturally with verifiable rewards, where correctness is cheap to check and hard to game. DeepSeek-R1 showed that this recipe could push reasoning performance to the frontier, and GRPO has since become a default starting point for teams training reasoning and agentic models.
GRPO works beautifully when a task is a single response with one verifiable outcome. It struggles when a task is a long sequence of actions, which is exactly the setting of modern AI agents.
The reason is the credit assignment problem. Standard RL, GRPO included, typically assigns one reward at the end of a long task. If an agent takes 50 steps and fails, every step receives the same negative signal, even the steps that were correct. The model cannot tell which action actually caused the failure, so the gradient is noisy and training on long, multi-step tasks becomes unstable, often converging slowly or not at all without extra structure.
The analogy: imagine running a six-month project, and at the end your manager says only “the project failed,” with no check-ins along the way. You would have no idea which decision to fix. That is standard RL on a long task, and it is the gap newer methods aim to close.
A line of recent research extends GRPO’s group-relative idea with finer-grained credit, so that an agent earns reward at intermediate milestones rather than only at the end. The shared principle is to replace one global guess with many local signals.
One notable example is GiGPO (Group-in-Group Policy Optimization, NeurIPS 2025), which keeps GRPO’s critic-free design but adds a second, inner level of comparison: an outer group compares whole episodes, while an inner group compares actions taken at the same task checkpoint. That yields step-level credit without a value model or extra rollouts, and the paper reports gains of more than 12% over GRPO on ALFWorld and more than 9% on WebShop at the same memory cost. Other approaches partition a trajectory into milestone-bounded segments with a local baseline per segment, or use a hierarchical planner and executor that each train on their own reward.
The takeaway for anyone training agents: the algorithm is only half the story. Milestone-based methods need training data and environments that actually expose intermediate structure, per-phase reward signals, milestone boundaries, and partial-progress credit. The structure of the data increasingly matters as much as its quantity.
Snorkel AI builds the expert data and environments that make advanced RL methods work, including the milestone-structured environments and per-phase evaluation that GRPO’s successors depend on. That is the same focus reflected in Snorkel’s work on agent evaluation, from Agents’ Last Exam and JudgmentBench to Terminal-Bench Science.