Self-consistency looks foolproof on paper: sample several times, vote, done. In practice, most of the failures are silent. The procedure still runs, still produces a confident-looking winner, and still feels rigorous. It is just wrong, because one of the steps quietly defeated the others. Silent failures are the dangerous kind, because nothing alerts you.
This article names seven specific ways the technique breaks, why each one happens, what it costs you, and the corrective practice. They are ordered roughly by how often teams hit them. None require deep expertise to avoid; they require knowing they exist.
If you have not run the technique before, read Running a Self-Consistency Vote, One Step at a Time first so these mistakes have context to attach to.
Mistake One: Sampling With No Randomness
Why it happens
Many people copy a working prompt that ran at temperature zero, then wrap sampling around it without touching the temperature. The result is the same answer N times.
The cost and the fix
You pay N times the cost for one answer's worth of information, and a unanimous-looking vote that means nothing. Set temperature to roughly 0.7 so samples actually diverge. Verify by eyeballing two raw samples; if they are identical, the vote is theater.
Mistake Two: Voting on Open-Ended Text
Why it happens
The technique works so well on math that people apply it to essays, summaries, and advice without noticing those answers cannot be compared for equality.
The cost and the fix
Every sample is unique, so there is no majority to find; you end up arbitrarily picking one and calling it consensus. Restrict voting to discrete, comparable answers. If the task is open-ended, extract a discrete claim or score to vote on, or use a different method entirely.
Mistake Three: Skipping Normalization
Why it happens
The answers look the same to a human, so the cosmetic differences go unnoticed: "12" versus "12.0", trailing spaces, "yes" versus "Yes".
The cost and the fix
A clear winner splits into several near-identical entries, producing a false tie or handing the win to a minority answer. Normalize before tallying: trim, lowercase where appropriate, and standardize numbers and units. This is the most common silent killer of an otherwise correct run. The insidious part is that the result still looks plausible, a close vote between two formats of the same number reads exactly like a genuinely contested answer, so nothing alerts you that the tally was corrupted.
Mistake Four: Too Few Samples
Why it happens
Cost-consciousness pushes people to two or three samples, which feels close enough.
The cost and the fix
With three samples, a single unlucky outlier can tie or win. The vote is too noisy to trust. Use at least five, more for hard problems. The principles for choosing a count appear in Sharp Habits for Voting Across Model Samples.
Mistake Five: Ignoring the Margin
Why it happens
People look only at the winning answer and discard the vote split, treating a six-four win exactly like a ten-zero one.
The cost and the fix
You lose the free confidence signal the technique hands you, and act on shaky results as if they were certain. Always record the margin and set a threshold below which you escalate or sample more. A close vote is a warning, not a verdict. The margin costs nothing to capture and is often the most valuable byproduct of the entire run, so discarding it is pure waste.
Mistake Six: Applying It Everywhere by Default
Why it happens
Once a technique works, the temptation is to make it the standard wrapper on every query.
The cost and the fix
On easy, stable questions, every sample already agrees, so you multiply cost for zero benefit. Reserve self-consistency for hard or high-stakes queries. Use the wobble test: if a single prompt already returns the same answer consistently, it does not need voting. The targeting logic is covered in Where Majority-Vote Prompting Earns Its Keep.
Mistake Seven: Letting Samples See Each Other
Why it happens
In a poorly structured pipeline, samples share context or run in a single conversation, so later samples are influenced by earlier ones.
The cost and the fix
The samples stop being independent, so their agreement no longer signals correctness; it signals herding. Each sample must be a fresh, isolated call. Independence is the property that makes the vote trustworthy, and it is easy to break without noticing.
How to Catch These Before They Bite
Inspect raw samples regularly
Most of these failures are visible the moment you actually read a handful of raw samples. Are they identical? Are the answers in the format you expect? Do the same values appear in different surface forms? A two-minute eyeball of five raw outputs catches the majority of silent failures before they reach production.
Track the vote split over time
Logging margins is not just for individual decisions; the distribution of margins across many queries is diagnostic. If a task suddenly starts producing thin margins where it used to be decisive, something changed, the model, the data, or the prompt, and the margin trend surfaces it before accuracy visibly degrades.
Build a small known-answer test set
A dozen problems with verified answers lets you confirm the whole pipeline, format, sampling, normalization, and tallying, actually produces correct winners. Run it whenever you change anything. It is the cheapest insurance against a regression that would otherwise hide behind confident-looking output.
The Pattern Behind the Mistakes
Most failures are silent by construction
Notice the common thread: nearly every mistake here produces a result that still looks rigorous. Identical samples yield a confident unanimous vote. Unnormalized tallies yield a plausible close vote. Voting on prose yields a winner that is just an arbitrary pick. The technique almost never errors out; it quietly returns the wrong thing. That is why active inspection matters so much more here than with techniques that fail loudly.
Defense is cheaper than detection
Each fix is trivial to apply up front and painful to retrofit after a bad decision has propagated downstream. Setting temperature correctly, restricting to discrete answers, normalizing, and keeping samples independent are all one-time design choices. Detecting a corrupted tally weeks later, after it has fed bad data into something else, costs far more. The economics strongly favor doing it right at setup.
Frequently Asked Questions
Which mistake is the most damaging?
Voting on open-ended text and skipping normalization tie for worst, because both produce a confident result that is meaningless. The first applies the technique where it cannot work; the second corrupts a tally that would otherwise be correct.
How do I tell if my samples are actually independent?
Trace your pipeline and confirm each model call starts from the same fixed prompt with no shared conversation history. If a sample can reference what a previous one said, independence is broken.
Is more samples always safer?
Up to a point. Beyond about ten, the winner rarely changes, so extra samples mostly add cost and latency. The right move is enough samples for a stable margin, not the maximum you can afford.
Why does normalization matter so much?
Because the tally compares answers for exact equality. Any cosmetic difference the model introduces, like a unit or a capital letter, counts as a separate answer and dilutes the real winner. Normalization makes equal answers actually equal.
Can I use self-consistency on classification tasks?
Yes, and it works well, because labels are discrete and comparable. Just normalize label strings so "Spam" and "spam" do not split the vote, and you avoid mistake three on the task it is most prone to.
What is the wobble test?
Run a single prompt a few times. If the answer stays the same, the question is stable and does not need voting. If it changes, that instability is exactly what self-consistency is designed to resolve. It is a quick way to decide whether the technique is worth the cost.
Key Takeaways
- Most self-consistency failures are silent: the run completes and looks rigorous while being wrong.
- Sampling at temperature zero gives identical answers and a meaningless vote; use about 0.7.
- Voting only works on discrete, comparable answers, never on open-ended text.
- Normalize answers before tallying, or a clear winner fractures into a false tie.
- Use at least five samples, record the margin, and escalate on close votes.
- Reserve the technique for hard or high-stakes queries, and keep every sample independent.