Going From Technically Coherent to Genuinely Useful

Reinforcement learning from human feedback (RLHF) is the core technique behind why modern AI assistants actually feel usable. It's the reason a language model can follow nuanced instructions, decline harmful requests, and generate outputs that feel calibrated to human judgment rather than just statistically likely. If you've ever wondered how a model goes from "technically coherent but often wrong" to "genuinely useful," RLHF is most of the answer.

Understanding how RLHF works matters beyond academic curiosity. Agencies and professionals who grasp the mechanics can evaluate AI tools with better judgment, design feedback loops that improve their custom models, and avoid the failure modes that plague teams who treat AI outputs as either magic or garbage. This article lays out the process in concrete, sequential steps — from the raw pretrained model to a fine-tuned system that behaves the way you actually want.

The process has three major phases: supervised fine-tuning, reward model training, and policy optimization via reinforcement learning. Each phase feeds the next. Skipping or rushing any one of them produces predictable, correctable failures. Here's how to do it right.

Phase 1: Start With a Pretrained Base Model

RLHF doesn't start from scratch. You begin with a model that already has broad language capability — a foundation model pretrained on large text corpora. This base model can complete sentences, answer questions, and summarize text, but it's not aligned to any specific behavior or tone. It's a raw capability engine.

What "Pretrained" Actually Means

Pretraining teaches the model to predict the next token in a sequence, trained across billions of documents. The result is a model with embedded knowledge of grammar, facts, reasoning patterns, and style. It also embeds whatever biases, errors, and problematic content appear in that training data. This is worth keeping in mind — you inherit those artifacts when you start fine-tuning. Neural Networks: Myths vs Reality covers common misconceptions about what these base models actually "know" versus what they're pattern-matching.

Selecting Your Starting Point

For most agency practitioners, you're choosing from an existing open-weight model (Llama 3, Mistral, Falcon, etc.) or working through an API with a provider that lets you apply fine-tuning. Your choice should be driven by:

Parameter count: Larger models (7B–70B) handle complex tasks better but cost more to run and fine-tune
License terms: Commercial use requirements vary significantly across open-weight models
Context window: Determines how much input text the model can process in a single pass
Community support: Active fine-tuning communities mean better tooling and documented failure modes

Phase 2: Supervised Fine-Tuning (SFT)

Before you can train a reward model or run reinforcement learning, you need the base model to produce reasonable outputs in your target domain. Supervised fine-tuning accomplishes this by training the model on a curated set of input-output pairs where a human has written the ideal response.

Building Your SFT Dataset

This is where quality matters more than quantity. A dataset of 1,000–5,000 high-quality demonstrations typically outperforms 50,000 noisy ones. Each example should include:

A realistic prompt representing actual use cases you care about
A human-written response that exemplifies exactly what "good" looks like

For an agency building a client communication assistant, "good" might mean: professional tone, no hedging language, responses under 150 words, proactive next-step suggestion. Every example in your SFT dataset should embody those criteria — not just occasionally, but consistently.

Running Supervised Fine-Tuning

SFT is standard gradient descent on your demonstration data. Using parameter-efficient methods like LoRA (Low-Rank Adaptation) dramatically reduces compute requirements without meaningful quality loss. You're not retraining all parameters — you're adjusting a small adapter layer on top of the frozen base model.

Key settings to watch:

Learning rate: Typically 1e-5 to 3e-5 for instruction fine-tuning
Epochs: 2–4 passes through your dataset; more risks overfitting
Evaluation set: Hold out 10–15% of your demonstrations to measure generalization

After SFT, you have a model that can generate plausible outputs in your domain. What you don't have yet is a model that reliably ranks its own good outputs above its bad ones. That's what the reward model addresses.

Phase 3: Collect Comparison Data for the Reward Model

This is the "human feedback" part of RLHF, and it's operationally the most intensive phase for teams without ML infrastructure. You're generating pairs of model outputs and having humans rank which one is better.

Designing the Comparison Interface

Don't hand annotators a blank rubric. Give them structured criteria tied to your actual goals:

Helpfulness: Does the response answer the question or complete the task?
Harmlessness: Does it avoid unsafe, misleading, or inappropriate content?
Honesty: Does it accurately represent uncertainty rather than confabulating?
Style fit: Does it match the tone and format your use case requires?

Annotators should choose between two (or more) sampled responses to the same prompt. The structured criteria reduce inter-annotator disagreement from typical ranges of 30–40% down to 10–20% with good labeling guidelines.

How Many Comparisons Do You Need?

Typical RLHF training runs use anywhere from 10,000 to 150,000 comparison pairs. For a domain-specific fine-tune, you can often get meaningful reward signal from 5,000–15,000 well-curated comparisons. Volume is not the bottleneck — annotation quality and prompt diversity are.

Ensure your prompts span the distribution of real inputs. If your assistant will face both simple and complex queries, adversarial requests, and off-topic questions, all of those need representation in your comparison data.

Phase 4: Train the Reward Model

The reward model is a classifier trained on your comparison data. It takes a prompt plus a response and outputs a scalar score representing how good that response is, according to your human annotators' revealed preferences.

Architecture Choices

Start with the same base model you used for SFT (or a similarly sized model), then replace the language modeling head with a regression head that outputs a single score. Training objective: the reward model should assign a higher score to the preferred response than the rejected response in each pair.

The Bradley-Terry model is the most common framework here — it treats each comparison as a probability that response A beats response B given their respective reward scores. This is differentiable and plays well with standard training pipelines.

Validating the Reward Model

Before moving to RL, test whether your reward model actually captures human preferences or just learned surface patterns. Run held-out comparisons through it and check agreement with human labels. Aim for 70–80%+ agreement. Anything below 60% suggests your annotation criteria were unclear, your training data was too noisy, or the model needs more capacity.

This is a critical checkpoint. A poorly calibrated reward model poisons everything downstream. The RL stage will optimize hard against whatever signal you give it — if that signal is wrong, the model will become very good at satisfying the wrong objective. The Hidden Risks of Neural Networks (and How to Manage Them) addresses exactly these kinds of specification failures in more depth.

Phase 5: Policy Optimization With RL (PPO)

This is the reinforcement learning step. You use the reward model as an environment that scores outputs, and you update the language model's weights to generate responses that score higher. The most common algorithm for this is Proximal Policy Optimization (PPO).

How PPO Works in RLHF

The language model is the "policy" — it generates a sequence of tokens (actions) given a prompt (state). For each generated response, the reward model produces a scalar score. PPO then updates the policy to increase the probability of high-scoring responses and decrease the probability of low-scoring ones, while constraining how far each update can move from the previous policy.

The constraint matters. Unconstrained optimization causes reward hacking — the model finds degenerate outputs that score high on the reward model but fail in practice. Common failure modes include:

Extremely long outputs (if length correlates with reward in your data)
Repetitive, sycophantic responses
Confident-sounding but factually wrong answers

The KL Penalty: Your Safety Rail

PPO in RLHF applies a KL divergence penalty between the updated policy and the original SFT model. This keeps the optimized model from drifting too far from the base — preventing reward hacking while still improving behavior.

Tune the KL coefficient (typically labeled β) carefully:

Too low: Reward hacking and capability degradation
Too high: Minimal improvement over the SFT baseline

Typical starting values range from 0.01 to 0.1. Adjust based on how quickly your evaluation metrics plateau.

Phase 6: Evaluate, Iterate, and Watch for Regressions

After a training run, you need structured evaluation before you deploy or run another iteration. Automated metrics alone (perplexity, ROUGE scores) will not tell you if the model is actually better for your use case.

Build a Golden Test Set

Create a fixed evaluation set of 200–500 prompts representing your full distribution of use cases, including edge cases and adversarial inputs. Run human evaluation on model outputs from each training iteration. Track:

Task success rate (did the model complete what was asked?)
Preference win rate vs. the previous checkpoint
Failure mode frequency (how often does it confabulate, refuse inappropriately, or break tone?)

When to Stop Iterating

More RL is not always better. After 2–4 training iterations, most systems experience diminishing returns and increasing risk of capability regression. Signs it's time to stop optimizing and focus on deployment:

Preference win rate improvements drop below 3–5% per iteration
New failure modes appear in categories that were previously stable
Human evaluators report the model feels "over-optimized" — too careful, too hedged, or personality-flattened

For a deeper view of how to build these iterative workflows into your production practice, Building a Repeatable Workflow for Neural Networks is directly applicable.

Direct Preference Optimization: A Simpler Alternative

PPO is compute-intensive and operationally complex. Direct Preference Optimization (DPO) has emerged as a viable alternative that skips the separate reward model entirely, using your comparison data to directly fine-tune the language model. DPO is easier to implement, requires less infrastructure, and often achieves comparable results for domain-specific applications.

The trade-off: DPO is less flexible for ongoing optimization. You're baking preferences into the weights directly, which makes real-time feedback loops harder to incorporate. For agencies doing one-time or periodic fine-tunes, DPO is often the right call. For teams building systems where human feedback continues to flow, PPO's separation of reward model and policy remains valuable.

Frequently Asked Questions

Do I need massive infrastructure to run RLHF?

Not for domain-specific fine-tuning. Using LoRA adapters, a 7B-parameter model can be fine-tuned on a single A100 GPU for the SFT and reward model phases. The PPO phase is more resource-intensive but can be run with frameworks like TRL (Transformer Reinforcement Learning) on consumer-grade GPU clusters. Teams with modest budgets can complete a full RLHF run for a specific task in the range of $500–$5,000 in cloud compute, depending on model size and dataset volume.

How do I handle disagreement between annotators?

Disagreement is normal and informative. Track inter-annotator agreement per category — if annotators consistently disagree on, say, "tone fit," that's a signal your criteria need sharper definitions. For pairs with high disagreement, either adjudicate with a senior reviewer or discard them from training data entirely. Forcing noisy disagreements into your reward model adds inconsistency that compounds downstream.

Can RLHF make a model worse?

Yes, and it does so in predictable ways. Reward hacking — where the model exploits gaps in the reward model — is the most common failure. Sycophancy (the model telling users what they want to hear rather than what's accurate) is another. Both stem from optimizing too hard against an imperfect reward signal. The KL penalty and regular capability regression testing are your primary defenses.

What's the difference between RLHF and simple fine-tuning?

Standard supervised fine-tuning teaches the model to mimic specific outputs. RLHF teaches the model to optimize for human preference across a wide range of outputs — including situations not in the training data. SFT alone often produces a model that performs well on seen prompts but fails on variations. RLHF generalizes preference signals more robustly, at the cost of significantly more complexity. Neural Networks: The Questions Everyone Asks, Answered has a broader treatment of fine-tuning concepts if you want the full picture.

Is RLHF only relevant for large language models?

RLHF was popularized through LLMs but the framework applies to any model where human preference is the optimization target. It's been used in robotics, game-playing agents, and recommendation systems. For agencies, the most practical applications today are text-generation models, coding assistants, and image generation pipelines where human aesthetic judgment is the quality signal.

Key Takeaways

RLHF runs in three sequential phases: supervised fine-tuning, reward model training, and PPO-based policy optimization — each feeds the next
SFT dataset quality matters more than size; 1,000–5,000 well-crafted demonstrations consistently outperform much larger noisy sets
The reward model is your highest-leverage component — a miscalibrated reward model produces a model that's very good at satisfying the wrong objective
The KL penalty is not optional; it's what prevents reward hacking and preserves the base model's general capabilities
DPO is a legitimate, simpler alternative to PPO for teams doing periodic fine-tunes without ongoing feedback loops
Evaluation must include human judgment on a fixed golden test set — automated metrics won't surface the failure modes that matter in production
Diminishing returns appear quickly after 2–4 RL iterations; over-optimization produces sycophantic, capability-regressed models

Phase 1: Start With a Pretrained Base Model

What "Pretrained" Actually Means

Selecting Your Starting Point

Parameter count: Larger models (7B–70B) handle complex tasks better but cost more to run and fine-tune
License terms: Commercial use requirements vary significantly across open-weight models
Context window: Determines how much input text the model can process in a single pass
Community support: Active fine-tuning communities mean better tooling and documented failure modes

Phase 2: Supervised Fine-Tuning (SFT)

Building Your SFT Dataset

This is where quality matters more than quantity. A dataset of 1,000–5,000 high-quality demonstrations typically outperforms 50,000 noisy ones. Each example should include:

A realistic prompt representing actual use cases you care about
A human-written response that exemplifies exactly what "good" looks like

Running Supervised Fine-Tuning

Key settings to watch:

Learning rate: Typically 1e-5 to 3e-5 for instruction fine-tuning
Epochs: 2–4 passes through your dataset; more risks overfitting
Evaluation set: Hold out 10–15% of your demonstrations to measure generalization

Phase 3: Collect Comparison Data for the Reward Model

Designing the Comparison Interface

Don't hand annotators a blank rubric. Give them structured criteria tied to your actual goals:

Helpfulness: Does the response answer the question or complete the task?
Harmlessness: Does it avoid unsafe, misleading, or inappropriate content?
Honesty: Does it accurately represent uncertainty rather than confabulating?
Style fit: Does it match the tone and format your use case requires?

How Many Comparisons Do You Need?

Phase 4: Train the Reward Model

Architecture Choices

Validating the Reward Model

Phase 5: Policy Optimization With RL (PPO)

How PPO Works in RLHF

The constraint matters. Unconstrained optimization causes reward hacking — the model finds degenerate outputs that score high on the reward model but fail in practice. Common failure modes include:

Extremely long outputs (if length correlates with reward in your data)
Repetitive, sycophantic responses
Confident-sounding but factually wrong answers

The KL Penalty: Your Safety Rail

Tune the KL coefficient (typically labeled β) carefully:

Too low: Reward hacking and capability degradation
Too high: Minimal improvement over the SFT baseline

Typical starting values range from 0.01 to 0.1. Adjust based on how quickly your evaluation metrics plateau.

Phase 6: Evaluate, Iterate, and Watch for Regressions

Build a Golden Test Set

Task success rate (did the model complete what was asked?)
Preference win rate vs. the previous checkpoint
Failure mode frequency (how often does it confabulate, refuse inappropriately, or break tone?)

When to Stop Iterating

Preference win rate improvements drop below 3–5% per iteration
New failure modes appear in categories that were previously stable
Human evaluators report the model feels "over-optimized" — too careful, too hedged, or personality-flattened

For a deeper view of how to build these iterative workflows into your production practice, Building a Repeatable Workflow for Neural Networks is directly applicable.

Direct Preference Optimization: A Simpler Alternative

Frequently Asked Questions

Do I need massive infrastructure to run RLHF?

How do I handle disagreement between annotators?

Can RLHF make a model worse?

What's the difference between RLHF and simple fine-tuning?

Is RLHF only relevant for large language models?

Key Takeaways

RLHF runs in three sequential phases: supervised fine-tuning, reward model training, and PPO-based policy optimization — each feeds the next
SFT dataset quality matters more than size; 1,000–5,000 well-crafted demonstrations consistently outperform much larger noisy sets
The reward model is your highest-leverage component — a miscalibrated reward model produces a model that's very good at satisfying the wrong objective
The KL penalty is not optional; it's what prevents reward hacking and preserves the base model's general capabilities
DPO is a legitimate, simpler alternative to PPO for teams doing periodic fine-tunes without ongoing feedback loops
Evaluation must include human judgment on a fixed golden test set — automated metrics won't surface the failure modes that matter in production
Diminishing returns appear quickly after 2–4 RL iterations; over-optimization produces sycophantic, capability-regressed models

Going From Technically Coherent to Genuinely Useful

Phase 1: Start With a Pretrained Base Model

What "Pretrained" Actually Means

Selecting Your Starting Point

Phase 2: Supervised Fine-Tuning (SFT)

Building Your SFT Dataset

Running Supervised Fine-Tuning

Phase 3: Collect Comparison Data for the Reward Model

Designing the Comparison Interface

How Many Comparisons Do You Need?

Phase 4: Train the Reward Model

Architecture Choices

Validating the Reward Model

Phase 5: Policy Optimization With RL (PPO)

How PPO Works in RLHF

The KL Penalty: Your Safety Rail

Phase 6: Evaluate, Iterate, and Watch for Regressions

Build a Golden Test Set

When to Stop Iterating

Direct Preference Optimization: A Simpler Alternative

Frequently Asked Questions

Do I need massive infrastructure to run RLHF?

How do I handle disagreement between annotators?

Can RLHF make a model worse?

What's the difference between RLHF and simple fine-tuning?

Is RLHF only relevant for large language models?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Going From Technically Coherent to Genuinely Useful

Phase 1: Start With a Pretrained Base Model

What "Pretrained" Actually Means

Selecting Your Starting Point

Phase 2: Supervised Fine-Tuning (SFT)

Building Your SFT Dataset

Running Supervised Fine-Tuning

Phase 3: Collect Comparison Data for the Reward Model

Designing the Comparison Interface

How Many Comparisons Do You Need?

Phase 4: Train the Reward Model

Architecture Choices

Validating the Reward Model

Phase 5: Policy Optimization With RL (PPO)

How PPO Works in RLHF

The KL Penalty: Your Safety Rail

Phase 6: Evaluate, Iterate, and Watch for Regressions

Build a Golden Test Set

When to Stop Iterating

Direct Preference Optimization: A Simpler Alternative

Frequently Asked Questions

Do I need massive infrastructure to run RLHF?

How do I handle disagreement between annotators?

Can RLHF make a model worse?

What's the difference between RLHF and simple fine-tuning?

Is RLHF only relevant for large language models?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?