## Thesis
A 7B model beats much larger systems on math by pairing a web‑mined, high‑quality math corpus with a critic‑free RL algorithm (GRPO) that makes RL training cheaper and more stable.
## Spine Outline (Zettels/Maps)
- [[Reinforcement learning is supervised learning with uncertainty.md]] → unify RL framing
- [[Samples are not independent in value-based deep reinforcement learning.md]] → contrast with critic variance
- [[Samples are not identically-distributed in value-based deep reinforcement learning.md]] → on‑policy distribution shift
- [[Many forget that the Bitter Lesson includes both search AND deep learning.md]] → data/compute > clever tricks
## Figures To Draw (Excalidraw)
- Iterative corpus mining loop: seed → classify → inspect → refine → re‑mine
- GRPO training schematic: group sampling, reward, mean‑baseline, PPO‑style update
- Results at a glance: GSM8K/MATH bars vs. Minerva 540B and peers
- Data ablations: Web math vs. arXiv‑only; 40B→120B→160B tokens
## Primary Sources
- DeepSeekMath paper: https://arxiv.org/abs/2402.03300
- PPO (baseline method): https://arxiv.org/abs/1707.06347
- OpenWebMath (seed corpus): https://openreview.net/pdf?id=0Qh3fls_9b
- Minerva: https://arxiv.org/abs/2206.14858
- GSM8K: https://arxiv.org/abs/2110.14168
- MATH: https://arxiv.org/abs/2103.03874
## Summary
DeepSeekMath, released in February 2024, demonstrated that a carefully-crafted 7B parameter model could surpass models 77× larger on mathematical reasoning tasks. The paper introduced two fundamental innovations that redefined the state-of-the-art in mathematical language modeling: a scalable pipeline for mining high-quality mathematical content from the web, and Group Relative Policy Optimization (GRPO), a novel reinforcement learning algorithm that eliminates the need for value networks while maintaining strong performance.
The core achievements of DeepSeekMath can be summarized as follows:
1. **Mathematical Corpus Construction:** A 120B-token mathematical corpus from Common Crawl
DeepSeek developed an iterative, self-improving pipeline to extract mathematical content from the web. Starting with OpenWebMath as a seed dataset, they trained fastText classifiers to identify math-like pages, discovered math-heavy domains, and iteratively refined their extraction process. The final corpus contained 35.5M pages totaling 120B tokens of high-quality mathematical text, carefully decontaminated to prevent evaluation set leakage.
2. **Reinforcement Learning Innovation:** Group Relative Policy Optimization (GRPO)
Traditional policy gradient methods like PPO require a separate value network to estimate advantages, adding memory and complexity. GRPO replaces the value network with a simple, within‑group baseline: sample multiple answers for the same question and center each reward by the group mean (“grading on a curve”). This removes the critic while preserving stability via a KL term to a reference policy. Intuitively: the policy learns to make answers slightly better than its own current median without needing a learned critic.
The combination of these innovations enabled DeepSeekMath-Base 7B to achieve 64.2% on GSM8K and 36.2% on MATH benchmarks using chain-of-thought prompting, surpassing Minerva-540B despite being orders of magnitude smaller. After GRPO training, DeepSeekMath-RL 7B reached 88.2% on GSM8K and 51.7% on MATH, with further improvements to 58.8% MATH when using tools and 60.9% MATH with self-consistency voting.
## The Math Corpus Mining Challenge
Building a high-quality mathematical corpus from web data presents unique challenges. Mathematical content on the web varies widely in quality, format, and difficulty level. Furthermore, distinguishing genuine mathematical content from pages that merely contain numbers or simple arithmetic requires sophisticated filtering mechanisms.
DeepSeek's solution was to develop an iterative, bootstrapping approach that progressively improved both the corpus quality and the extraction pipeline itself. The process began with OpenWebMath, a curated 14.7B token dataset, as the seed for training an initial fastText classifier. This classifier was then applied to Common Crawl to recall potentially mathematical pages.
The key insight was to treat corpus construction as an iterative optimization problem. After each round of extraction, DeepSeek would:
1. Rank pages by classifier confidence scores
2. Manually inspect high-scoring domains to identify math-heavy websites
3. Annotate URL patterns that consistently contained mathematical content
4. Retrain the classifier with the expanded positive examples
5. Re-extract from Common Crawl with the improved classifier
This process continued until marginal improvements became negligible, ultimately yielding 120B tokens from 35.5M web pages. Critically, the team performed extensive decontamination, removing any pages with n-gram overlap with evaluation benchmarks like GSM8K, MATH, and CMATH to ensure fair evaluation.
The resulting corpus had several notable characteristics that distinguished it from prior mathematical datasets. Unlike arXiv-heavy corpora that focus on formal mathematical exposition, DeepSeek's web-mined data contained a rich mixture of problem-solving content, step-by-step solutions, and educational materials. Ablation studies revealed this diversity was crucial - models trained on arXiv-only data actually performed worse on reasoning benchmarks than those trained on the web-mined corpus.
## Mathematical Pre-Training Strategy
With the corpus in hand, DeepSeek made a counterintuitive choice for their base model: instead of starting from a general-purpose language model, they initialized from DeepSeek-Coder-Base-v1.5 7B, a code-specialized model. Empirically, code pre‑training helps math because:
1. **Structured reasoning patterns:** Code naturally enforces logical flow and step-by-step execution
2. **Tool-use capabilities:** Programming experience translates directly to using computational tools for math
3. **Symbolic manipulation:** Both code and mathematics use constrained symbol systems
### GRPO, briefly (intuition first)
GRPO is PPO without a value model. For a group of $K$ samples $\{o_i\}$ for the same question $q$ with rewards $r_i$, compute advantages by subtracting the group mean: $A_i = r_i − (1/K) \sum r_j$ Optimize a clipped policy gradient with a small KL to a reference model for stability. This “relative within‑batch” baseline reduces memory and simplifies training while preserving the signal that matters: which answers are better than the model’s current median.
The continued pre-training process treated the 120B mathematical tokens as an extension of the model's training, using standard autoregressive language modeling objectives. The team experimented with different corpus sizes (40B, 80B, 120B, and 160B tokens) and found that performance scaled smoothly with data quantity up to 120B tokens, with diminishing returns beyond that point.
Importantly, the pre-training maintained a careful balance between preserving the model's existing capabilities and acquiring new mathematical knowledge. The learning rate schedule and training duration were calibrated to prevent catastrophic forgetting of the model's coding abilities while maximizing mathematical reasoning improvements.
## Supervised Fine-Tuning for Mathematical Instruction Following
Following pre-training, DeepSeekMath underwent supervised fine-tuning to align the model with user instructions and enable different reasoning formats. The instruction tuning phase incorporated three distinct types of mathematical reasoning data:
1. **Chain-of-Thought (CoT):** Step-by-step verbal reasoning leading to solutions
2. **Program-of-Thought (PoT):** Solutions expressed as executable code
3. **Tool-Integrated Reasoning:** Combinations of natural language and computational tool calls
The fine-tuning used a maximum context length of 4K tokens, trained for 500 steps with a batch size of 256 and learning rate of 5e-5. This relatively short fine-tuning phase was sufficient to unlock the model's latent mathematical capabilities acquired during pre-training.
DeepSeek formalized their approach within a unified policy gradient framework that encompasses SFT, RFT, DPO, PPO, and GRPO as special cases. In this view, each method differs only in three dimensions: the data source (human demonstrations vs. model samples), the reward source (implicit vs. explicit), and the specific gradient coefficient used for updates. This theoretical framework helped guide their choice of training strategies at each stage.
## Group Relative Policy Optimization (GRPO)
The most significant algorithmic contribution of DeepSeekMath is Group Relative Policy Optimization, a reinforcement learning method that achieves the benefits of policy gradient training without the computational overhead of value networks. Traditional methods like PPO require maintaining and training a separate critic network to estimate state values, which doubles memory requirements and adds training instability.
GRPO's key innovation is to use relative comparisons within a group of samples as the advantage estimate. For each question in the training set, GRPO:
1. Samples multiple candidate answers (typically 64) from the current policy
2. Computes rewards for each answer using an outcome-based verifier
3. Calculates advantages as the difference between each answer's reward and the group mean
4. Updates the policy using these relative advantages with PPO-style clipped objectives
Mathematically, the GRPO objective combines three components:
$\begin{align}
\mathcal{L}_{\text{GRPO}} = \mathcal{L}_{\text{clip}} + \beta \cdot \mathcal{L}_{\text{KL}} + \mathcal{L}_{\text{reg}}
\end{align}$
#### Sidebar — PPO’s value model: why it exists and what GRPO replaces
Start here (no PPO background required). Think of RL for LLMs as “nudging the model to produce answers that score higher on a task-specific grader.” The policy is your LM $\pi_\theta(y\,|\,x)$; you sample an answer $y$ (a token sequence) for a question/prompt $x$, get a scalar reward $r(y, x)$ (e.g., 1 if the math answer is correct, 0 otherwise), and push the model to make high-reward answers more likely.
Policy gradient in one line. The simplest objective is $J(\theta)=\mathbb{E}_{y\sim\pi_\theta(\cdot|x)}[r(y,x)]$. The REINFORCE gradient is:
$\begin{align}
\nabla_\theta J(\theta) \,=\, \mathbb{E}\big[\nabla_\theta \log \pi_\theta(y\,|\,x)\; r(y,x)\big]\;.
\end{align}$
Why subtract a baseline? That raw gradient is very noisy. We can subtract any baseline $b(x)$ that does not depend on the sampled answer $y$ without changing the expectation:
$\begin{align}
\mathbb{E}\big[\nabla_\theta \log \pi_\theta(y|x)\,b(x)\big]
&= b(x)\, \nabla_\theta \sum_y \pi_\theta(y|x)
= b(x)\, \nabla_\theta 1
= 0\;.
\end{align}$
So we use advantages $A = r - b$ in place of $r$ to reduce variance while keeping the gradient unbiased. The art is choosing a good baseline.
From sequences to per-token credit. In LLMs, $y = (y_1,\dots,y_T)$ and log-prob decomposes: $\log \pi_\theta(y|x)=\sum_t \log \pi_\theta(y_t\,|\,x, y_{<t})$. The policy gradient distributes “credit” across tokens:
$\begin{align}
\nabla_\theta J \propto \sum_t \mathbb{E}\big[\nabla_\theta \log \pi_\theta(y_t|x,y_{<t})\, A_t\big]\;.
\end{align}$
PPO’s value model (critic). PPO picks a state-dependent baseline $b(s_t)\approx V_\phi(s_t)=\mathbb{E}[\text{return}|s_t]$, where $s_t=(x,y_{<t})$ is the decoder state at token $t$. It then uses Generalized Advantage Estimation (GAE) to smoothly attribute terminal reward back through time, balancing bias/variance:
$\begin{align}
\delta_t &= r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\\
A_t^{\text{GAE}(\lambda)} &= \sum_{\ell=0}^{T-t-1} (\gamma\lambda)^{\ell}\, \delta_{t+\ell}\;.
\end{align}$
Intuition: $V(s_t)$ predicts how good things look from token $t$ onward. If the observed return is better than expected, $\delta_t>0$ and earlier tokens get positive credit; worse than expected yields negative credit. This reduces variance and gives finer temporal credit than a single terminal scalar.
PPO’s stabilized update. PPO trusts the old policy $\pi_{\theta_{old}}$ and takes conservative steps using a probability ratio and clipping (plus often a KL penalty to a reference policy for extra stability):
$\begin{align}
r_t(\theta) &= \frac{\pi_\theta(y_t|s_t)}{\pi_{\theta_{old}}(y_t|s_t)}\\
\mathcal{L}_{\text{clip}} &= \mathbb{E}_t\big[\min\big( r_t(\theta) A_t,\; \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\big)\big]\;.
\end{align}$
What the value model buys you
- Per‑token credit shaping: $A_t$ assigns nuanced credit/blame across the answer.
- Variance reduction and sample efficiency: $V(s)$ soaks up predictable return structure.
- Cross‑prompt calibration: $V(s)$ adapts to varying difficulty/reward scales across questions.
What it costs (and where it fails)
- Extra compute/memory: value head parameters, activations, optimizer state, and a separate value loss.
- Tuning complexity: value loss weight, value clipping, $\gamma$, $\lambda$, advantage normalization.
- Critic lag/overfit: if $V$ is wrong, $A_t$ is noisy or biased, destabilizing learning.
GRPO: a nonparametric baseline at the prompt level. Instead of learning $V(s)$, GRPO estimates a baseline from multiple answers to the same question. For prompt $x$, sample $K$ completions $\{y^{(i)}\}_{i=1}^K$, compute rewards $r^{(i)}$, and center them:
$\begin{align}
A^{(i)} \;=\; r^{(i)} - \underbrace{\frac{1}{K}\sum_{j=1}^K r^{(j)}}_{\text{group mean}}\;.
\end{align}$
Then apply a PPO‑style clipped objective to each token of completion $i$, broadcasting the same $A^{(i)}$ to all its tokens, and keep a small KL to a reference policy for stability. Intuition: “grade on a curve” within a prompt—reward only how much better a sample is than its peers, not its absolute score.
Why this is a natural evolution
- Same unbiased trick, different baseline: You’re still subtracting a baseline that doesn’t depend on the sampled action sequence’s tokens; it just comes from a mini‑population estimate for the prompt.
- Strong per‑prompt calibration: The group mean approximates $\mathbb{E}[r\,|\,x]$. As $K$ grows, the baseline noise shrinks (variance of the mean $\sim \sigma_r^2/K$).
- Fewer moving parts: No critic to train/tune → lower memory, fewer instabilities, higher throughput.
Trade‑offs vs. a value model
- Temporal precision: GRPO uses one scalar advantage per completion (all tokens share it). PPO’s $A_t$ shapes credit across time. If dense/process rewards matter, PPO’s critic can be superior.
- Sampling budget: GRPO needs $K>1$ samples per prompt. With tiny $K$, the baseline is noisy and variance rises.
- Outcome dependence: Works best when you have a reliable outcome grader (math correctness, program tests). With weak/noisy rewards, a critic can help denoise.
Worked example (math correctness)
- Suppose $K=4$ completions for the same problem with rewards $r=[1, 0, 0.5, 0]$. The mean is $\bar r=0.375$.
- Advantages: $A=[+0.625, -0.375, +0.125, -0.375]$.
- Gradient effect: tokens in the correct answer get uniformly positive push; the mediocre one gets a small push; the incorrect ones get pulled back. No critic was needed to estimate $V(s_t)$.
Variance intuition with K
- If rewards for a prompt have variance $\sigma_r^2$, then $\operatorname{Var}(\bar r) = \sigma_r^2/K$. Centered returns $r-\bar r$ have variance roughly $\sigma_r^2(1-1/K)$ (ignoring small covariances), so larger $K$ reduces baseline noise and stabilizes updates.
Pseudocode (LM framing)
PPO with value model (GAE):
1) Collect on‑policy data with $\pi_{\theta_{old}}$: tokens, logprobs, values $V_\phi(s_t)$, and rewards.
2) Compute $\delta_t$ and $A_t=\text{GAE}(\delta_t,\gamma,\lambda)$.
3) Optimize: maximize $\mathcal{L}_{\text{clip}}(A_t)$ + minimize value loss (e.g., clipped MSE to returns) + optional KL to reference.
GRPO (critic‑free):
1) For each prompt $x$, sample $K$ completions $\{y^{(i)}\}$ from $\pi_{\theta_{old}}$.
2) Compute rewards $r^{(i)}$ via an outcome verifier; set $A^{(i)}=r^{(i)}-\frac{1}{K}\sum_j r^{(j)}$.
3) For each completion, apply PPO‑style clipped loss using token‑wise ratios but a single $A^{(i)}$; add a small KL to a reference policy.
Practical notes for LLMs
- Length effects: Outcome rewards can favor short/long answers. Consider normalization (e.g., per‑token scaling) if you observe pathological length drift.
- KL control: Keep the KL coefficient small but nonzero to avoid reward hacking and preserve language quality.
- Mixed training: You can mix GRPO with a small amount of critic‑based updates (or process rewards) when temporal shaping is important.
When to choose what
- Prefer GRPO if: (a) you have a reliable outcome grader; (b) memory/throughput constrain you; (c) you can afford $K$ samples per prompt.
- Prefer PPO+value if: (a) rewards are dense/process‑based; (b) you need token‑level temporal credit; (c) sampling budget per prompt is small.
Where:
- $\mathcal{L}_{\text{clip}}$ is the clipped policy gradient loss using group-relative advantages
- $\mathcal{L}_{\text{KL}}$ is the KL divergence from a reference policy for stability
- $\mathcal{L}_{\text{reg}}$ includes standard regularization terms
The group-relative advantage computation is elegantly simple:
$\begin{align}
A_i = r_i - \frac{1}{G}\sum_{j=1}^{G} r_j
\end{align}$
Where $r_i$ is the reward for sample $i$ and $G$ is the group size. This "grading on a curve" approach naturally normalizes advantages and provides a stable baseline without requiring value function approximation.
## Training Protocol and Results
DeepSeek's RL training focused exclusively on GSM8K and MATH problems (approximately 144K questions total) to test generalization to unseen domains. The training protocol used:
- 64 samples per question for advantage estimation
- Policy learning rate of 1e-6
- KL coefficient of 0.04
- Single policy update per exploration round
The results demonstrated remarkable efficiency and generalization:
**Base Model Performance (No RL):**
- GSM8K: 64.2% (CoT)
- MATH: 36.2% (CoT)
- Outperformed Minerva-540B despite being 77× smaller
**After GRPO Training:**
- GSM8K: 88.2% (CoT) - a 24 point improvement
- MATH: 51.7% (CoT) - a 15.5 point improvement
- MATH: 58.8% (with tools)
- MATH: 60.9% (self-consistency with 64 samples)
Crucially, the improvements generalized to out-of-domain benchmarks:
- CMATH: 84.6% → 88.8%
- Strong performance on Hungarian math competitions
- Maintained capabilities on general reasoning tasks
## Analysis of What Matters for Mathematical Reasoning
Through extensive ablations, DeepSeek identified several critical factors for mathematical language model success:
**1. Code Pre-training is Essential**
Models initialized from code-trained checkpoints consistently outperformed those starting from general language models. The benefits extended beyond tool use - even pure chain-of-thought reasoning improved with code pre-training, suggesting that programming experience provides valuable structured thinking patterns.
**2. Web Math Beats Formal Math**
Contrary to intuition, training on arXiv papers alone produced worse reasoning performance than web-scraped mathematical content. The informal, problem-solving oriented nature of web mathematics better matches the format of reasoning benchmarks and real-world applications.
**3. Data Quality Trumps Quantity After a Threshold**
While scaling from 40B to 120B tokens provided consistent improvements, gains beyond 120B were marginal. This suggests that with sufficient quality filtering, relatively modest corpus sizes can achieve strong mathematical reasoning.
**4. Iterative Mining Beats Static Filtering**
The iterative corpus construction process, where each round improved both the data and the extraction pipeline, proved more effective than one-shot filtering approaches. This bootstrapping strategy could generalize to other specialized domains.
## Implications and Future Directions
DeepSeekMath's success challenges several assumptions in the field. First, it demonstrates that specialized 7B models can match or exceed much larger general-purpose models on complex reasoning tasks when given appropriate data and training. Second, it shows that eliminating architectural components (like value networks in RL) through algorithmic innovation can yield both efficiency and performance gains.
GRPO, in particular, opens a path to more accessible RLHF: no critic means lower memory, fewer moving parts, and fewer instabilities. On fixed hardware, that either fits larger policies or more samples per update.
Looking forward, several directions emerge:
1. **Domain-Specific Corpus Mining:** The iterative extraction pipeline could be adapted for other specialized domains like science, medicine, or law
2. **Process Supervision at Scale:** While DeepSeekMath used outcome-based rewards, incorporating process supervision could further improve reasoning quality
3. **Multi-Domain GRPO:** Extending GRPO to handle diverse reward signals could enable training generalist models with specialized capabilities
DeepSeekMath ultimately shows that careful data curation + simple, robust RL can make small models punch above their weight. For practitioners, the playbook is clear: build the right corpus, keep the algorithm simple, and let iterative evaluation guide the next improvement.
## Caveats and Open Questions
- Reward design: outcome‑only vs process supervision trade‑offs in math
- Generalization: does GRPO’s advantage hold beyond math domains?
- Data costs: human inspection in the loop and domain heuristics
- Decontamination: robust leakage checks as corpora scale
- Hyperparameters: group size, KL strength, sampling budgets under fixed compute