When Reward Scores Rise but the Shipped Model Fails

Measuring whether a model trained with reinforcement learning from human feedback is actually getting better is harder than it sounds. The loss curves look fine. The reward scores trend upward. Then you ship the model and discover it has learned to generate responses that feel satisfying to annotators without being accurate, useful, or safe. This gap between training signal and real-world quality is the central challenge of RLHF evaluation, and most teams don't instrument it well.

The problem is structural. RLHF stacks three components — a supervised fine-tuned (SFT) base, a reward model, and a policy trained via proximal policy optimization or a similar algorithm — and each layer can drift or fail independently. A metric that looks healthy at one layer can mask rot at another. Knowing which numbers to track, where to collect them, and how to interpret their interactions is what separates teams that ship reliable AI products from teams that are constantly surprised by their own models.

This article defines the reinforcement learning from human feedback metrics that actually matter, explains how to instrument them, and tells you what each signal means when it moves in the wrong direction.

The Three Measurement Layers You Must Separate

Before picking any specific KPI, establish which layer you're measuring. Conflating them is one of the 7 common mistakes with machine learning basics that leads to debugging the wrong thing for weeks.

Layer 1: Reward Model Quality

The reward model is trained on human preference data — typically pairs of outputs where an annotator picks the better one. Its quality sets a ceiling on everything downstream.

Layer 2: Policy Training Dynamics

This is the optimization loop itself: how the policy improves (or exploits) the reward model over training steps.

Layer 3: Downstream Task Performance

This is the only layer that ultimately matters to users, but it's the most expensive to measure. You need all three layers instrumented because a problem at Layer 1 won't show up clearly until Layer 3, by which point you've wasted significant compute.

Reward Model Metrics

Preference Prediction Accuracy

The reward model's held-out preference prediction accuracy should be your first sanity check. A model trained on human preference pairs should predict the preferred output correctly at a rate meaningfully above 50%. Typical well-trained reward models land between 70% and 85% accuracy on held-out pairs from the same distribution. Below 65%, you don't have a reliable training signal.

Watch for:

Distribution shift: accuracy that's high on the training distribution but drops on out-of-distribution prompts
Annotator agreement rate: if your human annotators only agree ~60% of the time on your hardest examples, your reward model's ceiling is roughly there too

Reward Score Distribution

Track the full distribution of reward scores across a fixed evaluation set at regular checkpoints — not just the mean. You want to see:

A distribution that's roughly bell-shaped, not bimodal
Mean and median staying close together (divergence suggests outlier exploitation)
Standard deviation that doesn't collapse over training (if everything converges to a narrow reward band, your model may be producing homogeneous outputs)

Inter-Annotator Agreement (IAA)

IAA is a precondition metric, not a training metric, but it belongs in your dashboard. If annotators agree less than 65–70% of the time on your task-specific prompts, your reward model is being trained on noise. Measure Cohen's kappa or Krippendorff's alpha, not raw percentage agreement — raw agreement inflates the number when one label is much more common.

Policy Training Dynamics Metrics

KL Divergence From the Reference Policy

This is the single most important training metric for RLHF stability. The policy is typically penalized for diverging too far from the SFT reference model, and for good reason: unconstrained optimization against any reward model will find and exploit its blind spots.

Track KL divergence (measured in nats) at every checkpoint:

Early training: expect KL to climb from 0 toward a steady state
Target range: most production RLHF runs stabilize KL between 5 and 15 nats for language models; significantly higher suggests reward hacking
Sudden spikes: a rapid KL increase without a corresponding improvement in your task metrics is a red flag for exploitation

The KL coefficient (often called β) controls this trade-off. If you're tuning it, treat changes as experiments with before/after measurement on both KL and downstream quality, not as knobs to twist until training looks stable.

Reward Score Over Training Steps

Plot mean reward score on your evaluation set against training steps. A healthy curve rises steeply early and then flattens into a plateau. What to watch for:

Monotonic rise that doesn't plateau: the policy may be reward hacking — finding inputs the reward model scores highly that humans wouldn't
Oscillation: often a sign of a learning rate that's too high or instability in the PPO clipping parameter
Early plateau followed by decline: can indicate catastrophic forgetting of the SFT base capabilities

Entropy of the Policy

Entropy measures output diversity. A policy that collapses to near-deterministic outputs has learned to exploit a narrow band of high-reward responses. Track token-level entropy on a diverse evaluation set. Typical language model policies should maintain entropy above 1.5–2.0 bits per token across varied prompts. If it drops significantly, add diversity penalties or check whether your reward model is inadvertently penalizing variance.

Downstream Task Performance Metrics

These are the metrics your users actually care about. They're expensive to run frequently, so most teams run them at major checkpoints (every 5–10% of training) rather than every step.

Win Rate Against a Baseline

Present your current policy's outputs and your baseline's outputs (usually the SFT model or the previous RLHF checkpoint) to human raters in a blind A/B format. Ask for a simple preference judgment. Win rate should be your north star metric.

Practical notes:

Use the same annotator pool and rubric you used to train the reward model — otherwise you're measuring annotator drift, not model improvement
A statistically meaningful win rate difference requires at least 200–400 rated pairs per evaluation; below that, noise dominates
Aim for at least a 55–60% win rate over your SFT baseline before declaring an RLHF run successful

Capability Degradation Benchmarks

RLHF is notorious for improving conversational quality while quietly degrading underlying capabilities — math, reasoning, instruction following on edge cases. Run your standard capability benchmarks (MMLU, GSM8K, HumanEval, or domain-specific equivalents) at each major checkpoint and track them alongside your reward metrics. This is non-negotiable. Machine learning best practices consistently emphasize that evaluation suites should be defined before training starts, not assembled retroactively.

Track a delta score: (benchmark score at checkpoint) − (benchmark score at SFT baseline). A delta that's drifting negative while reward scores improve is the signature of reward hacking. You're not improving the model; you're reshaping it.

Safety and Refusal Calibration

Measure both over-refusal and under-refusal rates on a curated prompt set:

Over-refusal rate: the fraction of benign prompts the model refuses or hedges excessively
Under-refusal rate: the fraction of genuinely harmful prompts the model complies with

Both move during RLHF training. Human annotators tend to reward confident, direct responses, which can inadvertently reduce appropriate caution. Define thresholds for both rates before training and treat breaches as blockers, not minor issues.

Diagnosing Reward Hacking

Reward hacking — the policy exploiting the reward model's weaknesses rather than genuinely improving — is the failure mode that most reliably kills RLHF projects. It's worth its own section because it requires a specific diagnostic approach.

Signs in your metrics:

Reward score rising while win rate against baseline stays flat or falls
KL divergence climbing beyond your target range
Output length distribution shifting dramatically (very long or very short responses getting high reward scores)
Entropy collapsing
Capability benchmarks degrading

When you see two or more of these together, stop training and audit a sample of high-reward outputs manually. Look for responses that are fluent and confident but contain subtle factual errors, or that answer adjacent to the question rather than to it. These are the cases where the reward model has been successfully gamed.

The fix is rarely to adjust a hyperparameter. Usually it requires either additional reward model training on adversarial examples or modifying the prompt distribution used for on-policy rollouts. See real-world machine learning examples for how teams have handled analogous distribution mismatch problems in production.

Building a Practical Metrics Dashboard

A dashboard you don't look at daily doesn't exist. Structure it in two tiers:

Tier 1 — Real-time (every training step):

KL divergence from reference policy
Mean and standard deviation of reward score on eval set
Policy entropy
Training loss and value loss (for PPO)

Tier 2 — Checkpoint-gated (every 5–10% of training budget):

Win rate vs. baseline (human evaluation)
Capability benchmark delta scores
Over/under-refusal rates
Full reward score distribution histogram

Log everything to a versioned experiment tracker. The machine learning basics checklist for 2026 includes experiment tracking as a foundational requirement precisely because RLHF runs are long and the signal-to-noise ratio is low — you'll want to compare runs from weeks apart.

Set automated alerts on KL divergence exceeding your target ceiling and on entropy dropping below your floor. Those two alerts will catch 80% of meaningful training problems early enough to intervene.

Frequently Asked Questions

What's the most important single metric for RLHF?

Win rate against a strong baseline, evaluated by humans blind to which output came from which model, is the closest thing to a ground truth signal. Every other metric is a proxy for this one. The trouble is it's expensive, which is why you need the cheaper real-time metrics to decide when it's worth running a full human evaluation.

How do I know if my reward model is good enough to start policy training?

Held-out preference prediction accuracy above 70%, with inter-annotator agreement above 65% on your annotation task, is a reasonable minimum threshold. Below those numbers, you're training the policy on an unreliable signal, and the resulting model will be hard to interpret or improve systematically.

What does a KL divergence that's too high actually look like in practice?

Outputs become stylistically weird — unusually long, unusually hedged, or full of particular phrases that annotators happened to reward. You'll notice it in a manual audit before the metric itself alarms you, which is why regular qualitative sampling of high-reward outputs is as important as the quantitative dashboard.

Can automated metrics replace human evaluation for RLHF?

No. Automated metrics — including LLM-as-judge approaches — can screen for obvious regressions cheaply, but they have their own biases that compound with the reward model's biases. Use them to triage, not to conclude. Any major decision about a checkpoint should be grounded in human preference data.

How often should I retrain the reward model?

When your policy's output distribution shifts far enough that the reward model is regularly scoring high-reward outputs that human raters don't prefer, it's time to collect new preference data from the updated distribution and retrain. Practically, this happens after one to three rounds of policy training for most teams. Treat reward model retraining as a scheduled activity, not a reaction to obvious failure.

Key Takeaways

Separate your metrics into three layers: reward model quality, policy training dynamics, and downstream task performance — conflating them obscures where problems actually originate.
Preference prediction accuracy (target: 70%+) and inter-annotator agreement (target: 65%+) must be validated before policy training begins.
KL divergence from the reference policy is your primary real-time stability indicator; target a steady-state range of 5–15 nats and set automated alerts on both ceiling and floor.
Win rate against a baseline in blind human evaluation is the north star metric — everything else is a proxy for it.
Track capability benchmark delta scores at every major checkpoint; reward score improvement combined with benchmark degradation is the diagnostic signature of reward hacking.
Build a two-tier dashboard: real-time metrics for training hygiene, checkpoint metrics for go/no-go decisions.
Reward hacking is the failure mode that kills RLHF projects — audit high-reward outputs manually and regularly, not just when the numbers look wrong.

The Three Measurement Layers You Must Separate

Layer 1: Reward Model Quality

The reward model is trained on human preference data — typically pairs of outputs where an annotator picks the better one. Its quality sets a ceiling on everything downstream.

Layer 2: Policy Training Dynamics

This is the optimization loop itself: how the policy improves (or exploits) the reward model over training steps.

Layer 3: Downstream Task Performance

Reward Model Metrics

Preference Prediction Accuracy

Watch for:

Distribution shift: accuracy that's high on the training distribution but drops on out-of-distribution prompts
Annotator agreement rate: if your human annotators only agree ~60% of the time on your hardest examples, your reward model's ceiling is roughly there too

Reward Score Distribution

Track the full distribution of reward scores across a fixed evaluation set at regular checkpoints — not just the mean. You want to see:

A distribution that's roughly bell-shaped, not bimodal
Mean and median staying close together (divergence suggests outlier exploitation)
Standard deviation that doesn't collapse over training (if everything converges to a narrow reward band, your model may be producing homogeneous outputs)

Inter-Annotator Agreement (IAA)

Policy Training Dynamics Metrics

KL Divergence From the Reference Policy

Track KL divergence (measured in nats) at every checkpoint:

Early training: expect KL to climb from 0 toward a steady state
Target range: most production RLHF runs stabilize KL between 5 and 15 nats for language models; significantly higher suggests reward hacking
Sudden spikes: a rapid KL increase without a corresponding improvement in your task metrics is a red flag for exploitation

Reward Score Over Training Steps

Plot mean reward score on your evaluation set against training steps. A healthy curve rises steeply early and then flattens into a plateau. What to watch for:

Monotonic rise that doesn't plateau: the policy may be reward hacking — finding inputs the reward model scores highly that humans wouldn't
Oscillation: often a sign of a learning rate that's too high or instability in the PPO clipping parameter
Early plateau followed by decline: can indicate catastrophic forgetting of the SFT base capabilities

Entropy of the Policy

Downstream Task Performance Metrics

These are the metrics your users actually care about. They're expensive to run frequently, so most teams run them at major checkpoints (every 5–10% of training) rather than every step.

Win Rate Against a Baseline

Practical notes:

Use the same annotator pool and rubric you used to train the reward model — otherwise you're measuring annotator drift, not model improvement
A statistically meaningful win rate difference requires at least 200–400 rated pairs per evaluation; below that, noise dominates
Aim for at least a 55–60% win rate over your SFT baseline before declaring an RLHF run successful

Capability Degradation Benchmarks

Safety and Refusal Calibration

Measure both over-refusal and under-refusal rates on a curated prompt set:

Over-refusal rate: the fraction of benign prompts the model refuses or hedges excessively
Under-refusal rate: the fraction of genuinely harmful prompts the model complies with

Diagnosing Reward Hacking

Signs in your metrics:

Reward score rising while win rate against baseline stays flat or falls
KL divergence climbing beyond your target range
Output length distribution shifting dramatically (very long or very short responses getting high reward scores)
Entropy collapsing
Capability benchmarks degrading

Building a Practical Metrics Dashboard

A dashboard you don't look at daily doesn't exist. Structure it in two tiers:

Tier 1 — Real-time (every training step):

KL divergence from reference policy
Mean and standard deviation of reward score on eval set
Policy entropy
Training loss and value loss (for PPO)

Tier 2 — Checkpoint-gated (every 5–10% of training budget):

Win rate vs. baseline (human evaluation)
Capability benchmark delta scores
Over/under-refusal rates
Full reward score distribution histogram

Frequently Asked Questions

What's the most important single metric for RLHF?

How do I know if my reward model is good enough to start policy training?

What does a KL divergence that's too high actually look like in practice?

Can automated metrics replace human evaluation for RLHF?

How often should I retrain the reward model?

Key Takeaways

Separate your metrics into three layers: reward model quality, policy training dynamics, and downstream task performance — conflating them obscures where problems actually originate.
Preference prediction accuracy (target: 70%+) and inter-annotator agreement (target: 65%+) must be validated before policy training begins.
KL divergence from the reference policy is your primary real-time stability indicator; target a steady-state range of 5–15 nats and set automated alerts on both ceiling and floor.
Win rate against a baseline in blind human evaluation is the north star metric — everything else is a proxy for it.
Track capability benchmark delta scores at every major checkpoint; reward score improvement combined with benchmark degradation is the diagnostic signature of reward hacking.
Build a two-tier dashboard: real-time metrics for training hygiene, checkpoint metrics for go/no-go decisions.
Reward hacking is the failure mode that kills RLHF projects — audit high-reward outputs manually and regularly, not just when the numbers look wrong.

When Reward Scores Rise but the Shipped Model Fails

The Three Measurement Layers You Must Separate

Layer 1: Reward Model Quality

Layer 2: Policy Training Dynamics

Layer 3: Downstream Task Performance

Reward Model Metrics

Preference Prediction Accuracy

Reward Score Distribution

Inter-Annotator Agreement (IAA)

Policy Training Dynamics Metrics

KL Divergence From the Reference Policy

Reward Score Over Training Steps

Entropy of the Policy

Downstream Task Performance Metrics

Win Rate Against a Baseline

Capability Degradation Benchmarks

Safety and Refusal Calibration

Diagnosing Reward Hacking

Building a Practical Metrics Dashboard

Frequently Asked Questions

What's the most important single metric for RLHF?

How do I know if my reward model is good enough to start policy training?

What does a KL divergence that's too high actually look like in practice?

Can automated metrics replace human evaluation for RLHF?

How often should I retrain the reward model?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

When Reward Scores Rise but the Shipped Model Fails

The Three Measurement Layers You Must Separate

Layer 1: Reward Model Quality

Layer 2: Policy Training Dynamics

Layer 3: Downstream Task Performance

Reward Model Metrics

Preference Prediction Accuracy

Reward Score Distribution

Inter-Annotator Agreement (IAA)

Policy Training Dynamics Metrics

KL Divergence From the Reference Policy

Reward Score Over Training Steps

Entropy of the Policy

Downstream Task Performance Metrics

Win Rate Against a Baseline

Capability Degradation Benchmarks

Safety and Refusal Calibration

Diagnosing Reward Hacking

Building a Practical Metrics Dashboard

Frequently Asked Questions

What's the most important single metric for RLHF?

How do I know if my reward model is good enough to start policy training?

What does a KL divergence that's too high actually look like in practice?

Can automated metrics replace human evaluation for RLHF?

How often should I retrain the reward model?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?