Leaderboard, Private Eval, or Vibes: They Answer Different Questions

Every team that picks a model eventually argues about benchmarks. One person points to a public leaderboard showing Model A on top. Another insists their own test set tells a different story. A third says none of it matters because users only care about how the thing feels. They are all partly right, and that is the problem.

Benchmarking is not one activity. It is a family of methods that measure different things at different costs. Public leaderboards, private held-out evals, head-to-head human preference tests, and production telemetry each answer a distinct question. Treating them as interchangeable is the fastest way to ship the wrong model.

This article lays out the competing approaches, the axes that actually move a decision, and a rule for choosing among them. The goal is not a ranking of benchmarks. It is a way to match the method to the question you are trying to answer.

The Three Families of Benchmark

Most benchmarking work falls into one of three buckets. Each has a characteristic strength and a characteristic blind spot.

Public Leaderboards

These are the MMLU, GPQA, SWE-bench, and arena-style scores you see in launch posts. Their value is breadth and comparability: every model runs the same questions, so you get a cheap first-pass filter across dozens of options.

The blind spot is contamination and overfitting. Popular benchmarks leak into training data, and vendors tune against them. A high public score tells you a model is not incompetent. It does not tell you the model is good at your task.

Private Held-Out Evals

Here you build your own test set from real tasks, keep it off the public internet, and grade model outputs against it. This is the only family that directly predicts performance on your workload.

The cost is real labor. You need representative examples, a grading rubric or a graders model, and discipline to keep the set fresh. But a private eval of even 100 well-chosen cases beats a public leaderboard for any decision that matters.

Human Preference and Production Signals

A/B tests, side-by-side preference voting, and live metrics like resolution rate or edit distance capture what automated scoring misses: tone, helpfulness, and whether users actually accept the output. They are the ground truth.

They are also slow and expensive, and they only work after you have something deployed. You cannot human-test fifteen candidate models before you have narrowed the field.

The Axes That Actually Decide

When people say one benchmark is better than another, they usually mean it scores higher on one of these axes. Naming them makes the trade-off explicit.

Task fidelity — how closely the benchmark resembles your real workload. Private evals win; generic leaderboards lose.
Cost to run — public scores are free to read; human preference tests can cost thousands per round.
Speed to signal — leaderboards give an answer today; production A/B tests take weeks.
Contamination resistance — held-out private sets resist gaming; famous public sets do not.
Statistical power — a 50-item eval has wide error bars; you need a few hundred cases to detect small differences reliably.

No single method maxes out every axis. Public leaderboards win on cost and speed and lose on fidelity. Private evals win on fidelity and contamination resistance and lose on labor. The right choice depends on which axis is binding for the decision in front of you.

A Decision Rule You Can Actually Use

Here is a sequence that respects the trade-offs instead of pretending they do not exist.

Step 1: Filter With Public Scores

Use leaderboards only to eliminate. Drop models that fail the obvious bar — too slow, too expensive, clearly behind on a relevant capability. Do not use public scores to pick a winner. Use them to get from twenty candidates to four.

Step 2: Rank With a Private Eval

Build a held-out set from real tasks and run your four finalists. This is where the actual decision happens. Weight the cases by how often each task type appears in production so the score reflects your traffic, not an even split.

Step 3: Confirm With Humans or Production

Take the top one or two and run a preference test or a limited A/B. This catches the failures private evals miss — outputs that are technically correct but unhelpful, or a model that scores well but feels worse to users.

If you skip step two, you ship based on contaminated public scores. If you skip step three, you ship a model that scores well and disappoints users. The sequence exists because each method covers the previous one's blind spot.

For a deeper walkthrough of building that held-out set, see A Step-by-Step Approach to AI Model Benchmarks. And before you commit to any single number, 7 Common Mistakes with AI Model Benchmarks (and How to Avoid Them) covers the traps that derail this process.

Common Failure Modes

The trade-offs above get violated in predictable ways. Watch for these.

Leaderboard worship — choosing the model with the highest public score and skipping the private eval entirely. The number is real; its relevance to your task is not.
Eval set rot — building a private set once and never updating it as your product changes. A stale eval optimizes for last quarter's workload.
Underpowered comparisons — declaring a winner from a 30-case test where the margin is within the noise. If you cannot estimate the error bar, you cannot trust the ranking.
Single-metric tunnel vision — optimizing accuracy while latency or cost quietly makes the model unusable. Every benchmark needs at least one quality axis and one cost axis.

If you want to understand which metrics belong on each axis before you start, How to Measure AI Model Benchmarks: Metrics That Matter breaks down the KPI side in detail.

Frequently Asked Questions

Should I trust public benchmark leaderboards at all?

Yes, but only for what they are good at: cheap, broad filtering. A high score means a model is not obviously deficient. It does not predict performance on your specific tasks because popular benchmarks leak into training data and get tuned against. Use them to narrow the field, never to pick the winner.

How many examples do I need in a private eval?

Enough to detect the difference you care about. For coarse decisions, 100 well-chosen, representative cases can separate clearly different models. To detect small margins reliably, you generally want a few hundred. The more similar your candidates are, the more cases you need to tell them apart with confidence.

Why not just run a production A/B test on every model?

Because A/B tests are slow and expensive, and they only work after deployment. You cannot live-test fifteen candidates. Human preference and production signals are ground truth, so reserve them for confirming the one or two finalists your private eval already surfaced.

What if my private eval and the public leaderboard disagree?

Trust the private eval. It measures your actual workload; the leaderboard measures a generic one that may be contaminated. Disagreement is the normal and expected outcome, and it is precisely why you built the private set.

Key Takeaways

Benchmarking is three different methods — public leaderboards, private evals, and human or production signals — each answering a distinct question.
The axes that decide are task fidelity, cost, speed, contamination resistance, and statistical power. No method wins on all of them.
Use public scores only to filter, private evals to rank, and human or production tests to confirm. Each step covers the previous one's blind spot.
Trust your private eval over any public leaderboard when they disagree, and keep that eval fresh as your product changes.

The Three Families of Benchmark

Most benchmarking work falls into one of three buckets. Each has a characteristic strength and a characteristic blind spot.

Public Leaderboards

Private Held-Out Evals

Here you build your own test set from real tasks, keep it off the public internet, and grade model outputs against it. This is the only family that directly predicts performance on your workload.

Human Preference and Production Signals

They are also slow and expensive, and they only work after you have something deployed. You cannot human-test fifteen candidate models before you have narrowed the field.

The Axes That Actually Decide

When people say one benchmark is better than another, they usually mean it scores higher on one of these axes. Naming them makes the trade-off explicit.

Task fidelity — how closely the benchmark resembles your real workload. Private evals win; generic leaderboards lose.
Cost to run — public scores are free to read; human preference tests can cost thousands per round.
Speed to signal — leaderboards give an answer today; production A/B tests take weeks.
Contamination resistance — held-out private sets resist gaming; famous public sets do not.
Statistical power — a 50-item eval has wide error bars; you need a few hundred cases to detect small differences reliably.

A Decision Rule You Can Actually Use

Here is a sequence that respects the trade-offs instead of pretending they do not exist.

Step 1: Filter With Public Scores

Step 2: Rank With a Private Eval

Step 3: Confirm With Humans or Production

Common Failure Modes

The trade-offs above get violated in predictable ways. Watch for these.

Leaderboard worship — choosing the model with the highest public score and skipping the private eval entirely. The number is real; its relevance to your task is not.
Eval set rot — building a private set once and never updating it as your product changes. A stale eval optimizes for last quarter's workload.
Underpowered comparisons — declaring a winner from a 30-case test where the margin is within the noise. If you cannot estimate the error bar, you cannot trust the ranking.
Single-metric tunnel vision — optimizing accuracy while latency or cost quietly makes the model unusable. Every benchmark needs at least one quality axis and one cost axis.

If you want to understand which metrics belong on each axis before you start, How to Measure AI Model Benchmarks: Metrics That Matter breaks down the KPI side in detail.

Frequently Asked Questions

Should I trust public benchmark leaderboards at all?

How many examples do I need in a private eval?

Why not just run a production A/B test on every model?

What if my private eval and the public leaderboard disagree?

Key Takeaways

Benchmarking is three different methods — public leaderboards, private evals, and human or production signals — each answering a distinct question.
The axes that decide are task fidelity, cost, speed, contamination resistance, and statistical power. No method wins on all of them.
Use public scores only to filter, private evals to rank, and human or production tests to confirm. Each step covers the previous one's blind spot.
Trust your private eval over any public leaderboard when they disagree, and keep that eval fresh as your product changes.

Leaderboard, Private Eval, or Vibes: They Answer Different Questions

The Three Families of Benchmark

Public Leaderboards

Private Held-Out Evals

Human Preference and Production Signals

The Axes That Actually Decide

A Decision Rule You Can Actually Use

Step 1: Filter With Public Scores

Step 2: Rank With a Private Eval

Step 3: Confirm With Humans or Production

Common Failure Modes

Frequently Asked Questions

Should I trust public benchmark leaderboards at all?

How many examples do I need in a private eval?

Why not just run a production A/B test on every model?

What if my private eval and the public leaderboard disagree?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Leaderboard, Private Eval, or Vibes: They Answer Different Questions

The Three Families of Benchmark

Public Leaderboards

Private Held-Out Evals

Human Preference and Production Signals

The Axes That Actually Decide

A Decision Rule You Can Actually Use

Step 1: Filter With Public Scores

Step 2: Rank With a Private Eval

Step 3: Confirm With Humans or Production

Common Failure Modes

Frequently Asked Questions

Should I trust public benchmark leaderboards at all?

How many examples do I need in a private eval?

Why not just run a production A/B test on every model?

What if my private eval and the public leaderboard disagree?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?