onNovember 27, 2025

Why Direct Preference Optimization (DPO) Is Quietly Killing RLHF—And What DeepSeek R1 Just Proved

11 min read

The alignment technique behind ChatGPT is being replaced, and most ML teams haven’t noticed. DeepSeek R1 just dropped the receipts.

The Silent Coup in AI Alignment

There’s a moment in every technological shift when the old guard realizes the ground has moved beneath their feet. For Reinforcement Learning from Human Feedback—RLHF—that moment arrived in January 2025, wearing Chinese colors and carrying benchmark scores that made OpenAI’s engineering teams take a very long pause.

DeepSeek R1 didn’t just match GPT-4 class performance. It did so while bypassing the entire RLHF apparatus that Western AI labs have spent years perfecting. No separate reward model. No complex three-stage training pipeline. No army of human annotators continuously feeding the preference machine.

The result? A model achieving 90.8% on MMLU and an extraordinary 71.0% on GPQA Diamond—the latter representing an improvement from 15.6% to 71.0% using pure reinforcement learning approaches. That’s not an incremental gain. That’s a paradigm validation.

The question isn’t whether simpler alignment methods can compete with RLHF anymore. DeepSeek R1 answered that definitively. The question is why so many teams are still building the complex way.

Understanding the Technical Stakes

Before we dissect what DeepSeek accomplished and why it matters for your ML infrastructure decisions, let’s establish what we’re actually comparing. Because the differences between these approaches aren’t just academic—they translate directly into engineering hours, compute costs, and time-to-deployment.

The RLHF Pipeline: Elegant But Expensive

Traditional RLHF, as documented in recent analyses of post-training optimization alternatives, requires three distinct phases:

Supervised Fine-Tuning (SFT): You take your pretrained model and fine-tune it on high-quality demonstrations of the behavior you want.
Reward Model Training: You collect human preference data—which response is better?—and train a separate model to predict human preferences.
Policy Optimization: You use reinforcement learning (typically PPO—Proximal Policy Optimization) to fine-tune your language model against the reward model’s judgments.

This is the architecture that powered ChatGPT’s remarkable ability to be helpful, harmless, and honest. It works. It works extremely well. But it also requires:

Continuous human annotation at scale
Training and maintaining a separate reward model
Managing the instabilities inherent in RL optimization
Balancing exploration-exploitation tradeoffs in policy training
Preventing reward hacking—where the model learns to game the reward signal rather than actually being helpful

The engineering overhead is substantial. The annotation costs are ongoing. And the failure modes can be subtle and dangerous.

Direct Preference Optimization: Collapsing the Stack

DPO takes a fundamentally different approach. As Nathan Lambert’s analysis of the 2025 post-training landscape makes clear, DPO treats alignment as a supervised learning problem rather than a reinforcement learning one.

The insight is elegant: instead of training a reward model and then using RL to optimize against it, you can derive a closed-form solution that lets you directly optimize the language model on preference data. You’re effectively collapsing two stages of the pipeline into one.

Aspect	RLHF	DPO
Training Stages	Three (SFT → Reward Model → RL)	Two (SFT → Direct Preference Training)
Reward Model Required	Yes (separate model)	No (implicit in objective)
Optimization Stability	Complex (PPO hyperparameters)	Simpler (supervised learning)
Training Time	Baseline	40-60% reduction
Implementation Complexity	High	Moderate

This isn’t just theoretical elegance. It translates to 40-60% reduction in training complexity and annotation costs. When you’re operating at the scale of frontier model development, those percentages represent millions of dollars and months of engineering time.

DeepSeek R1: The Existence Proof

Now, let’s talk about what actually happened in Hangzhou.

DeepSeek, a Chinese AI lab that’s been quietly building impressive infrastructure, released R1 in January 2025. The technical report—publicly available on arXiv—is one of the most transparent accounts of frontier model training we’ve seen from any major lab.

What they revealed upends several assumptions the field has been operating under.

Pure RL, No Traditional RLHF

DeepSeek R1 doesn’t use the RLHF pipeline that OpenAI popularized. Instead, it uses Group Relative Policy Optimization (GRPO), a technique that eliminates the need for a separate reward model entirely.

Andrew Ng’s analysis in The Batch breaks down the technical innovation: GRPO samples multiple responses for each prompt, scores them relative to each other, and uses these relative rankings to update the policy. You’re not training a reward model to predict absolute quality—you’re using relative comparisons within each batch to drive learning.

This approach sidesteps several of RLHF’s failure modes:

No reward model means no reward hacking against a fixed evaluator
Relative scoring is more stable than absolute reward prediction
The policy can improve beyond the capabilities of any fixed reward model

The Cold Start Problem—Solved Differently

Vellum’s detailed breakdown of the R1 training process reveals one of the most interesting aspects of the approach: the cold start.

DeepSeek started with just 5,000 long chain-of-thought examples. Not millions. Not hundreds of thousands. Five thousand carefully curated demonstrations of extended reasoning.

From this modest seed, they used rejection sampling to generate 600,000 synthetic training examples from the model’s own improved outputs. This is self-distillation at scale—the model teaching itself by selecting its own best reasoning traces and learning from them.

Five thousand examples became six hundred thousand. The model’s ceiling is no longer bounded by human annotation capacity—it’s bounded by its own ability to recognize quality in its outputs.

The Numbers That Matter

Performance claims are easy to make. Let’s look at what DeepSeek actually demonstrated:

MMLU: 90.8% — This is GPT-4 class performance on the most widely-used benchmark for general knowledge and reasoning
GPQA Diamond: 71.0% — Up from 15.6% before RL training. This benchmark tests graduate-level science reasoning, and the improvement is staggering
Test-time compute scaling: The model can allocate additional reasoning time on difficult problems, improving performance without any additional training

Jay Alammar’s illustrated guide to DeepSeek-R1 provides excellent visualizations of how this test-time compute scaling works in practice. The model learns to “think longer” when problems demand it—a capability that emerges from the RL training process.

Why This Changes Your Technical Roadmap

If you’re running an ML team in 2025, here’s what the DeepSeek results mean for your planning:

The Complexity Tax is Real—And Avoidable

Every layer of complexity in your training pipeline is a layer that can break. Every separate model you maintain is a model that can drift, degrade, or develop subtle bugs that corrupt downstream training.

RLHF’s three-stage pipeline gives you three opportunities for compounding errors. DPO-style approaches give you one less. GRPO-style approaches—where the reward signal comes from relative comparisons within batches—give you a different error profile entirely.

For teams operating without the engineering depth of OpenAI or Anthropic, this matters. The simpler approach isn’t just faster—it’s more debuggable, more reproducible, and more forgiving of the inevitable mistakes that happen in complex ML systems.

Annotation Economics Have Shifted

Traditional RLHF requires continuous human feedback. You need annotators comparing outputs, labeling preferences, and providing the data that trains your reward model. As your model improves, you need annotators who can evaluate increasingly sophisticated outputs—which means more expensive expertise.

The synthetic data approach DeepSeek demonstrated—using rejection sampling to generate training examples from the model’s own outputs—breaks this dependency. RLAIF (Reinforcement Learning from AI Feedback) reduces annotation costs by eliminating continuous human feedback requirements after initial training.

This is the path to alignment at scale without the linear cost growth of human annotation.

The Distillation Opportunity

One of the less-discussed aspects of the DeepSeek release: they also published distilled versions of R1’s capabilities into smaller models. The reasoning patterns learned by the large model can be compressed into architectures that run on more modest hardware.

For production deployment, this is crucial. Your research lab can train the largest models with the most sophisticated techniques. But the models that actually serve users—the ones running inference at scale—often need to be smaller, faster, and cheaper to operate.

The combination of advanced RL training for the large model plus distillation for the deployment model is a powerful pattern. And it’s one that doesn’t require RLHF at either stage.

The Broader Landscape: RLAIF, Constitutional AI, and What’s Next

DeepSeek’s approach isn’t happening in isolation. It’s part of a broader movement toward alignment techniques that scale better than traditional RLHF.

Constitutional AI: Principles Over Preferences

Anthropic’s Constitutional AI approach—where models are trained to follow explicit principles rather than learned from preference data—represents another departure from the RLHF paradigm. The model is given a “constitution” of rules and trained to self-critique and revise its outputs according to those rules.

This approach has its own tradeoffs, but it shares the key advantage of reducing dependency on continuous human annotation.

RLAIF: AI Evaluating AI

Reinforcement Learning from AI Feedback uses another AI model (often a larger or differently-trained model) to provide the preference signals that would traditionally come from humans. This creates a scaling dynamic that human annotation can’t match—AI evaluators don’t get tired, don’t have inconsistent labeling, and can process vastly more examples.

The combination of these approaches—DPO for direct optimization, RLAIF for scalable feedback, and techniques like GRPO for reward-model-free RL—is creating a new post-training toolkit that doesn’t center RLHF.

What RLHF Still Does Well

To be clear: RLHF isn’t obsolete. There are scenarios where its approach still makes sense:

Novel alignment objectives: When you’re trying to capture a new type of human preference that hasn’t been well-characterized, the explicit reward modeling stage of RLHF can provide more transparency into what the model is learning
Fine-grained safety requirements: When you need very specific safety properties and have the resources to train dedicated reward models that encode those properties
Interpretability requirements: The separate reward model in RLHF can be analyzed independently, providing insights into what preferences the system has learned

But these are increasingly edge cases. For the majority of production alignment work—getting models to be helpful, follow instructions, maintain appropriate tone—the simpler approaches are proving sufficient.

Implementation Considerations for Production Teams

If you’re considering moving from RLHF to DPO or related approaches, here’s a practical framework:

When to Stay with RLHF

You have an existing RLHF pipeline that’s working well and stable
You have specific safety requirements that your current reward model encodes
You need the interpretability of a separate reward model for compliance or auditing
You’re working on novel alignment objectives with poorly-characterized preferences

When to Consider DPO/GRPO

You’re building a new alignment pipeline from scratch
Training time and compute costs are significant constraints
You’re struggling with PPO instabilities or reward hacking
Your team doesn’t have deep RL expertise
You want to experiment with synthetic data and self-distillation approaches

Migration Path

For teams looking to transition:

Start with evaluation: Before changing anything, establish robust evals that capture the behaviors you care about. You need to measure whether the new approach maintains the properties RLHF was providing.
Run parallel experiments: Train with both approaches on the same base model and preference data. Compare outputs qualitatively and quantitatively.
Focus on failure modes: Pay special attention to edge cases and safety-relevant behaviors. DPO and RLHF can fail differently—make sure the new failure modes are acceptable.
Consider hybrid approaches: Some teams are finding success using DPO for initial alignment and reserving RLHF-style fine-tuning for specific safety properties.

The Strategic Picture: What DeepSeek Signals About the Industry

Beyond the technical details, DeepSeek R1 carries strategic implications that deserve attention.

The Gap is Closing—Faster

A Chinese lab produced a model competitive with GPT-4 using techniques that are, in some ways, more sophisticated than what Western labs have published. They did it with what appears to be a smaller budget and a different approach to training.

This isn’t about nationalism or competition—it’s about the democratization of frontier capabilities. When multiple labs can reach similar performance levels using different techniques, it suggests the problem is more solved than we realized. The moat isn’t in RLHF expertise anymore.

Open Publication Matters

DeepSeek published their technical report. They released model weights for distilled versions. They explained their training approach in detail.

This transparency accelerated the field’s understanding of what’s possible. Other teams can now build on their insights rather than independently rediscovering the same techniques.

The contrast with increasingly closed Western labs is striking. When major advances come with detailed technical reports, the entire ecosystem moves faster.

Synthetic Data Changes Everything

The move from 5,000 seed examples to 600,000 synthetic training examples represents a fundamental shift in how we think about data collection for alignment.

Human annotation doesn’t scale linearly with model capability. But model-generated data does. As models get better at producing high-quality outputs, they can increasingly train on their own best work.

This creates a feedback loop that previous approaches couldn’t access. The ceiling isn’t human annotation capacity anymore—it’s the model’s ability to recognize and select quality from its own outputs.

Looking Forward: The 2025 Post-Training Landscape

Here’s what I expect to see over the next 12 months:

DPO Becomes Default for Non-Frontier Work

For the majority of production fine-tuning—where you’re adapting models for specific domains or use cases—DPO or similar approaches will become the default choice. The complexity overhead of RLHF simply won’t be justified for these applications.

GRPO-Style Approaches Spread

The group relative optimization approach that DeepSeek demonstrated is surprisingly general. Expect to see variations applied to different domains—code generation, reasoning, creative tasks. The key insight—using relative rankings within batches rather than absolute reward models—is broadly applicable.

Synthetic Data Pipelines Mature

The rejection sampling approach to generating training data will become more sophisticated. Teams will develop better techniques for:

Selecting high-quality outputs from model generations
Creating diverse training distributions that avoid mode collapse
Maintaining safety properties through synthetic data generation

RLHF Specialists Pivot

Teams that built deep expertise in RLHF—reward modeling, PPO tuning, preference data collection—will need to adapt. The skills transfer somewhat, but the specific tooling and techniques will be less central than they were.

This is the normal pattern of technical evolution. The PPO experts of 2023 become the DPO/GRPO experts of 2025. The underlying understanding of what makes models behave well remains valuable even as the implementation details change.

The Uncomfortable Truth

Here’s what I tell clients when they ask about their alignment roadmap:

If you’re building RLHF infrastructure in 2025, you should have a specific reason that’s not “that’s what OpenAI did in 2022.”

The field moved. DeepSeek R1 proved it’s possible to reach frontier performance without the traditional RLHF pipeline. DPO and related approaches offer simpler implementation, faster training, and more stable optimization.

This doesn’t mean RLHF is wrong or that everyone should abandon it immediately. It means the default choice has shifted. The burden of proof now falls on teams choosing the more complex approach to justify that complexity.

For most production applications—for most teams—the simpler path leads to the same destination. DeepSeek proved that at frontier scale. The implications for everyone else are even more pronounced.

What This Means for Your ML Team

The practical takeaways:

Audit your current pipeline: If you’re using RLHF, ask whether the complexity is actually buying you something. If you can’t articulate what, experiment with simpler approaches.
Invest in evaluation: Whatever approach you use, robust evals are essential. The confidence to simplify comes from knowing you can measure the results.
Watch the synthetic data space: The move toward model-generated training data is accelerating. Build capabilities to generate, filter, and learn from synthetic examples.
Stay technically current: The techniques that define best practice are shifting faster than at any point in the field’s history. What worked 18 months ago may already be obsolete.

The teams that will win the next phase of AI development are the ones that recognize paradigm shifts while they’re happening—not after the transition is complete.

DeepSeek R1 is that kind of signal. Pure RL with relative optimization. Self-distillation at scale. Synthetic data replacing human annotation.

The coup against RLHF isn’t coming. It already happened. The only question is whether your organization has noticed yet.

DPO and GRPO aren’t just alternatives to RLHF—they’re what alignment looks like when you optimize for results rather than legacy, and DeepSeek R1 just proved that simpler architectures can match frontier performance at a fraction of the engineering cost.

Artur Markus

onNovember 27, 2025

What are You Looking For?

Why Direct Preference Optimization (DPO) Is Quietly Killing RLHF—And What DeepSeek R1 Just Proved

The Silent Coup in AI Alignment

Understanding the Technical Stakes

The RLHF Pipeline: Elegant But Expensive

Direct Preference Optimization: Collapsing the Stack

DeepSeek R1: The Existence Proof

Pure RL, No Traditional RLHF

The Cold Start Problem—Solved Differently

The Numbers That Matter

Why This Changes Your Technical Roadmap

The Complexity Tax is Real—And Avoidable

Annotation Economics Have Shifted

The Distillation Opportunity

The Broader Landscape: RLAIF, Constitutional AI, and What’s Next

Constitutional AI: Principles Over Preferences

RLAIF: AI Evaluating AI

What RLHF Still Does Well

Implementation Considerations for Production Teams

When to Stay with RLHF

When to Consider DPO/GRPO

Migration Path

The Strategic Picture: What DeepSeek Signals About the Industry

The Gap is Closing—Faster

Open Publication Matters

Synthetic Data Changes Everything

Looking Forward: The 2025 Post-Training Landscape

DPO Becomes Default for Non-Frontier Work

GRPO-Style Approaches Spread

Synthetic Data Pipelines Mature

RLHF Specialists Pivot

The Uncomfortable Truth

What This Means for Your ML Team

The Next Leap in AI: Agentic AI's Move from Task Automation to Autonomous Goal-Directed Systems in 2025

The Federal Preemption War: Why Trump's Attack on State AI Laws Is Creating a Constitutional Crisis for Enterprise Compliance

Read Next

The Federal Preemption War: Why Trump’s Attack on State AI Laws Is Creating a Constitutional Crisis for Enterprise Compliance

The NSA’s January 2025 Durable Content Credentials Push: Why Watermarking Is Now a National Security Imperative—Not Just an Ethics Exercise

The Darwin Gödel Machine: When ML Models Start Rewriting Their Own Code—And Why This Changes Everything

Why Direct Preference Optimization (DPO) Is Quietly Killing RLHF—And What DeepSeek R1 Just Proved

The Silent Coup in AI Alignment

Understanding the Technical Stakes

The RLHF Pipeline: Elegant But Expensive

Direct Preference Optimization: Collapsing the Stack

DeepSeek R1: The Existence Proof

Pure RL, No Traditional RLHF

The Cold Start Problem—Solved Differently

The Numbers That Matter

Why This Changes Your Technical Roadmap

The Complexity Tax is Real—And Avoidable

Annotation Economics Have Shifted

The Distillation Opportunity

The Broader Landscape: RLAIF, Constitutional AI, and What’s Next

Constitutional AI: Principles Over Preferences

RLAIF: AI Evaluating AI

What RLHF Still Does Well

Implementation Considerations for Production Teams

When to Stay with RLHF

When to Consider DPO/GRPO

Migration Path

The Strategic Picture: What DeepSeek Signals About the Industry

The Gap is Closing—Faster

Open Publication Matters

Synthetic Data Changes Everything

Looking Forward: The 2025 Post-Training Landscape

DPO Becomes Default for Non-Frontier Work

GRPO-Style Approaches Spread

Synthetic Data Pipelines Mature

RLHF Specialists Pivot

The Uncomfortable Truth

What This Means for Your ML Team

The Next Leap in AI: Agentic AI's Move from Task Automation to Autonomous Goal-Directed Systems in 2025

The Federal Preemption War: Why Trump's Attack on State AI Laws Is Creating a Constitutional Crisis for Enterprise Compliance

Read Next

The Federal Preemption War: Why Trump’s Attack on State AI Laws Is Creating a Constitutional Crisis for Enterprise Compliance

The NSA’s January 2025 Durable Content Credentials Push: Why Watermarking Is Now a National Security Imperative—Not Just an Ethics Exercise

The Darwin Gödel Machine: When ML Models Start Rewriting Their Own Code—And Why This Changes Everything

Subscribe to my Blog