A single answer from a language model is a sample, not a verdict. Ask the same reasoning question twice with a little randomness and you may get two different chains of thought leading to two different conclusions. Most of the time that variability is a nuisance. Self-consistency turns it into an asset.
The core idea is simple enough to state in a sentence: instead of generating one chain of reasoning and taking its answer, generate many independent chains and take the answer they agree on most. The reasoning paths differ, but correct reasoning tends to converge on the same destination while wrong reasoning scatters. A majority vote across samples filters out the scatter.
This guide covers the technique end to end. It explains the mechanism, when the payoff justifies the extra cost, how to implement it step by step, and the failure modes that catch teams off guard. It assumes you already know what chain-of-thought prompting is; if a term is unfamiliar, the sections below define it before using it.
What Self-Consistency Actually Is
The mechanism in plain terms
Self-consistency has three moving parts. First, you prompt the model to reason step by step toward an answer. Second, you sample that prompt several times with a non-zero temperature so each run explores a different path. Third, you extract the final answer from each run and pick the one that appears most often. The reasoning is discarded; only the votes count.
Why agreement signals correctness
For problems with a discrete answer, there are usually many ways to reach the right one and comparatively idiosyncratic ways to reach each wrong one. So correct answers cluster and incorrect answers spread thin. Taking the mode of the distribution is a cheap, effective way to recover the signal from the noise.
What it is not
It is not running the same deterministic prompt repeatedly; with temperature at zero you get the same output every time and learn nothing. It is also not asking the model to double-check within a single response, which is a different and weaker technique because the recheck shares all the blind spots of the original answer.
The intuition behind it
Picture the space of all the ways a model could reason about a problem as a branching set of paths. For a well-posed question, a large share of those paths arrive at the correct answer, even though they wander through different intermediate steps. The wrong answers are reached by comparatively few, idiosyncratic paths, and they tend to disagree with each other. When you sample many paths and look at where they land, the correct answer shows up as a dense cluster and the wrong answers as scattered singletons. The vote is just a way of measuring that density.
When the Technique Pays Off
Tasks with verifiable, discrete answers
Arithmetic, multi-step word problems, structured extraction, and classification all have answers you can compare for equality. That comparability is what makes voting possible. For a deeper look at concrete scenarios, see Where Majority-Vote Prompting Earns Its Keep.
High-stakes single decisions
When one wrong answer is expensive, paying for five or ten samples to raise confidence is an easy trade. The cost is linear in sample count; the error reduction is often steep on hard problems.
When a single pass is unstable
If you run a prompt a few times and the answer wobbles, that instability is exactly the condition self-consistency was built for. A stable prompt does not need it.
How to Run It
Build the base reasoning prompt
Write a prompt that asks for explicit step-by-step reasoning and a clearly delimited final answer, such as "Answer: <value>" on its own line. Clean extraction depends on a predictable answer format.
Sample with temperature
Run that prompt five to ten times with temperature around 0.7. The randomness is the point; it produces the diversity of reasoning paths that voting relies on. A complete worked sequence lives in Running a Self-Consistency Vote, One Step at a Time.
Extract and tally
Parse the final answer from each run, normalize it (trim whitespace, standardize units), and count occurrences. The most frequent answer wins. Record the vote split, because a 6-4 win means something very different from a 10-0 one.
Reading the Vote
Margin as a confidence signal
The spread between the top answer and the runner-up is a usable confidence measure. A landslide suggests the model is sure; a near-tie suggests the problem is genuinely hard or the prompt is ambiguous, and may warrant escalation.
When to add more samples
If the top two answers are close, drawing additional samples can break the tie. Set a rule in advance, such as "if the margin is one vote, sample five more," so the decision is not improvised. Improvised sampling, where you keep drawing until you like the result, reintroduces exactly the bias the technique was meant to remove.
What a fractured vote tells you
Sometimes no answer commands a majority and the samples spread across many distinct values. That pattern is itself a signal: the problem is underspecified, too hard for the model, or outside its competence. Adding samples will not consolidate a vote that is fractured for those reasons. Treat heavy fragmentation as a prompt to rephrase the question, add context, or route the case elsewhere rather than as noise to push through.
Costs and Trade-Offs
The obvious cost
You pay roughly N times the tokens for N samples. On easy problems that is pure waste, which is why self-consistency should be reserved for hard or high-value queries rather than applied everywhere by default. The best practices guide covers how to target it.
The hidden costs
Latency rises with sample count unless you parallelize. Extraction bugs can silently corrupt the tally. And free-form answers that resist normalization can fracture the vote across cosmetic variations. Budget design time for these, not just compute.
Comparing it to the alternatives
Before reaching for self-consistency, it is worth knowing what else is on the table. A simpler chain-of-thought prompt costs one call and is enough when the single pass is already stable. A larger or stronger model can raise accuracy but carries migration cost and may not fix variance-driven errors specifically. Self-consistency sits between these: it keeps your existing model and prompt, adds only a sampling-and-voting wrapper, and targets variance directly. That makes it the right first move when your errors come from inconsistency rather than from a fundamental gap in capability.
How It Relates to Other Techniques
It wraps rather than replaces
Self-consistency does not change your prompt; it changes how you use the prompt's outputs. Whatever sophistication lives in the base prompt, few-shot examples, retrieved context, careful instructions, is preserved. The wrapper sits outside, sampling that prompt and aggregating its answers. This is why it composes so cleanly with almost everything else in the prompting toolkit.
It trades tokens for reliability, deliberately
The whole technique is a single trade: spend more tokens to buy lower variance. That trade is only worth making where variance is actually hurting you and where errors are expensive. Understanding it as a deliberate, targeted trade, rather than a default to apply everywhere, is what separates teams that benefit from the technique from teams that just spend more.
Frequently Asked Questions
How many samples should I generate?
Five to ten is the common range. Below five, a single outlier can swing the vote; above ten, returns diminish quickly on most tasks. Tune by watching where added samples stop changing the winner, then stop a little past that point.
What temperature works best?
Around 0.7 is a reliable default. Too low and the samples are near-identical, defeating the purpose; too high and reasoning degrades into noise. The goal is diverse but still competent reasoning paths.
Does self-consistency work for open-ended generation?
Not directly. Voting needs answers you can compare for equality, and two essays are never equal. You can adapt the idea by extracting a discrete claim or score to vote on, but raw long-form text is the wrong target.
Is this the same as ensembling?
It is a close cousin. Classic ensembling combines different models; self-consistency combines different samples from one model. Both reduce variance by aggregating, but self-consistency needs only a single model and a temperature knob.
How is this different from asking the model to verify its answer?
Self-verification happens inside one response and shares that response's blind spots. Self-consistency draws independent samples, so a mistake in one path does not contaminate the others. Independence is what gives voting its power.
Can I combine it with other techniques?
Yes. It layers cleanly on top of chain-of-thought, few-shot prompting, and retrieval. The base prompt can be as sophisticated as you like; self-consistency simply wraps sampling and voting around whatever prompt you already trust.
Key Takeaways
- Self-consistency generates several reasoning paths and takes the majority answer instead of trusting one pass.
- Correct reasoning converges while wrong reasoning scatters, so the mode of the samples recovers the signal.
- It fits tasks with discrete, comparable answers and high-stakes or unstable single-pass results.
- Implement it with explicit reasoning, a parseable answer format, temperature near 0.7, and five to ten samples.
- The vote margin doubles as a confidence signal worth logging and acting on.
- The technique costs N times the tokens, so target it at hard or high-value queries rather than everything.