Running Plays That Make Models Reason in Steps

A technique becomes useful when it stops being a clever trick and becomes a repeatable play your team can run on demand. Most teams know what chain-of-thought is. Far fewer have a clear answer to the operational questions: which reasoning pattern do we reach for on this kind of request, what triggers it, who maintains it, and in what order do the steps fire.

This playbook treats multi-step reasoning prompts as an operating system rather than a single instruction. Each play below has a purpose, a trigger that tells you when to run it, an owner who keeps it healthy, and a place in the sequence. You can adopt one play or wire several together into a pipeline.

The plays assume you already understand the basics. If you are new to the subject, start with Multi-step Reasoning Prompts: A Beginner's Guide and come back when you are ready to operationalize.

Play 1: Decompose Before Answering

Purpose: Break a tangled request into ordered sub-questions so the model handles one decision at a time.

Trigger: Run this when a request bundles multiple constraints or asks for an answer that depends on several facts being established first.

The model first lists the sub-questions, then answers each in order, then synthesizes. This prevents the common failure where a model races to a conclusion while ignoring a constraint buried in the prompt.

How It Looks

Instruction: "List the questions you must answer to solve this, then answer each, then give the final result."
Owner: the prompt engineer who maintains the request template.
Output: an ordered analysis ending in a synthesized answer.

Play 2: Plan-Then-Execute Across Turns

Purpose: Separate the planning of a task from its execution so the plan can be reviewed before any work happens.

Trigger: Fire this for expensive or irreversible actions—code changes, document generation, anything where executing a bad plan costs real time.

In turn one, the model produces only a plan. A human or a second model reviews it. In turn two, the approved plan is executed step by step. The separation gives you a checkpoint between thinking and doing.

Why the Split Matters

When planning and execution share one prompt, a flawed plan gets executed before anyone notices. Splitting them inserts a natural gate. For agency teams, this gate is where a reviewer catches scope problems early. See The Complete Guide to Multi-step Reasoning Prompts for the underlying mechanics.

Play 3: Verify the Draft

Purpose: Catch errors the first pass missed by running an explicit review against criteria.

Trigger: Use it whenever an output is high-stakes or fact-dependent—client-facing copy, calculations, compliance summaries.

After the model produces a draft, a second prompt asks it to check the draft against specific criteria: factual support, constraint satisfaction, internal consistency. The model returns either a pass or a list of issues to fix.

Keeping Verification Honest

Give the verifier concrete criteria, not "check if this is good."
Consider using a fresh context so the verifier is not anchored on the draft's reasoning.
Log the verification results so you can see how often the first pass fails.

Play 4: Sample and Vote

Purpose: Improve accuracy on problems with one correct answer by reasoning multiple times and taking the majority.

Trigger: Reserve this for high-value, low-volume decisions where being right matters more than being cheap.

You run the same reasoning prompt several times with some randomness, collect the answers, and select the most common one. Divergent answers are themselves a signal—they tell you the problem is genuinely uncertain and may need a human.

The Cost Reality

Each vote multiplies your cost. This play earns its keep on a handful of consequential calls, not on your whole traffic. Pair it with routing so only the right requests reach it.

Play 5: Route by Difficulty

Purpose: Spend reasoning effort where it pays off and nowhere else.

Trigger: This is the always-on play that sits in front of the others.

A lightweight classifier or cheap model labels each incoming request as easy or hard. Easy requests get a direct prompt. Hard requests get the appropriate reasoning play above. This keeps latency and cost down across the bulk of traffic while preserving accuracy on the cases that need it.

Ownership and Maintenance

Routing rules drift as your inputs change. Assign an owner to review misroutes monthly—requests that got a direct prompt but needed reasoning, and the reverse. The common mistakes guide covers the failure modes that bad routing causes.

Sequencing the Plays Into a Pipeline

The plays compose. A mature pipeline often runs them in this order:

Route by difficulty to decide whether reasoning is needed at all.
Decompose the hard request into ordered sub-questions.
Plan-then-execute if the task involves action, with a review gate between.
Verify the draft against explicit criteria.
Sample and vote only when the stakes justify the multiplier.

You will rarely run all five on one request. The point is to know which plays exist, what each costs, and when to fire it—so the choice is deliberate rather than habitual.

Assigning Owners

Every play needs an owner who maintains its template, watches its metrics, and updates it when the model changes. Without an owner, prompts rot quietly: a model update shifts behavior and nobody notices until quality complaints arrive.

Instrumenting the Playbook

A playbook you cannot measure is a guess. Instrument each play with three signals:

Quality: accuracy against a held-out evaluation set.
Cost: tokens and dollars per request, by play.
Latency: end-to-end time, since reasoning adds meaningful delay.

Reviewing the Numbers

Hold a short monthly review where the owners look at these signals together. A play whose cost climbed without a quality gain is a candidate for trimming. A play that quietly degraded after a model update needs a refreshed prompt. This review is what keeps the playbook alive instead of becoming documentation nobody trusts.

Frequently Asked Questions

How many plays should a small team start with?

Two: route by difficulty and decompose before answering. Together they cover most of the value—you stop wasting reasoning on easy requests and you handle hard ones more reliably. Add verification and voting once you have traffic that justifies the extra cost.

Who should own the reasoning playbook?

A single prompt engineer or a small prompt operations group. The owner maintains templates, watches metrics, and refreshes prompts when models change. Diffusing ownership across many people leads to inconsistent prompts and silent quality drift.

Can these plays run with any model?

Yes, though the right mix shifts by model. Reasoning-tuned models need less explicit decomposition and benefit more from verification and routing. Smaller models benefit more from explicit step-by-step structure. Test each play on your target model before committing.

How do I keep the playbook from going stale?

Tie it to a monthly review of quality, cost, and latency per play, and re-run your evaluation set whenever you change models. Plays that lose their edge get trimmed; plays that degrade get refreshed. The review cadence is what prevents staleness.

What is the single biggest mistake teams make here?

Running heavy reasoning on every request. It inflates cost and latency without improving the easy cases. Routing by difficulty fixes this and is usually the highest-leverage play a team can adopt first.

Key Takeaways

Treat multi-step reasoning as a set of named plays, each with a purpose, trigger, owner, and place in sequence.
Start with routing by difficulty and decomposition; add verification and voting as traffic justifies them.
Plan-then-execute inserts a review gate between thinking and doing for high-stakes actions.
Every play needs an owner and instrumentation for quality, cost, and latency.
A monthly review keeps the playbook from rotting as models and inputs change.

The plays assume you already understand the basics. If you are new to the subject, start with Multi-step Reasoning Prompts: A Beginner's Guide and come back when you are ready to operationalize.

Play 1: Decompose Before Answering

Purpose: Break a tangled request into ordered sub-questions so the model handles one decision at a time.

Trigger: Run this when a request bundles multiple constraints or asks for an answer that depends on several facts being established first.

How It Looks

Instruction: "List the questions you must answer to solve this, then answer each, then give the final result."
Owner: the prompt engineer who maintains the request template.
Output: an ordered analysis ending in a synthesized answer.

Play 2: Plan-Then-Execute Across Turns

Purpose: Separate the planning of a task from its execution so the plan can be reviewed before any work happens.

Trigger: Fire this for expensive or irreversible actions—code changes, document generation, anything where executing a bad plan costs real time.

Why the Split Matters

Play 3: Verify the Draft

Purpose: Catch errors the first pass missed by running an explicit review against criteria.

Trigger: Use it whenever an output is high-stakes or fact-dependent—client-facing copy, calculations, compliance summaries.

Keeping Verification Honest

Give the verifier concrete criteria, not "check if this is good."
Consider using a fresh context so the verifier is not anchored on the draft's reasoning.
Log the verification results so you can see how often the first pass fails.

Play 4: Sample and Vote

Purpose: Improve accuracy on problems with one correct answer by reasoning multiple times and taking the majority.

Trigger: Reserve this for high-value, low-volume decisions where being right matters more than being cheap.

The Cost Reality

Each vote multiplies your cost. This play earns its keep on a handful of consequential calls, not on your whole traffic. Pair it with routing so only the right requests reach it.

Play 5: Route by Difficulty

Purpose: Spend reasoning effort where it pays off and nowhere else.

Trigger: This is the always-on play that sits in front of the others.

Ownership and Maintenance

Sequencing the Plays Into a Pipeline

The plays compose. A mature pipeline often runs them in this order:

Route by difficulty to decide whether reasoning is needed at all.
Decompose the hard request into ordered sub-questions.
Plan-then-execute if the task involves action, with a review gate between.
Verify the draft against explicit criteria.
Sample and vote only when the stakes justify the multiplier.

You will rarely run all five on one request. The point is to know which plays exist, what each costs, and when to fire it—so the choice is deliberate rather than habitual.

Assigning Owners

Instrumenting the Playbook

A playbook you cannot measure is a guess. Instrument each play with three signals:

Quality: accuracy against a held-out evaluation set.
Cost: tokens and dollars per request, by play.
Latency: end-to-end time, since reasoning adds meaningful delay.

Reviewing the Numbers

Frequently Asked Questions

How many plays should a small team start with?

Who should own the reasoning playbook?

Can these plays run with any model?

How do I keep the playbook from going stale?

What is the single biggest mistake teams make here?

Key Takeaways

Treat multi-step reasoning as a set of named plays, each with a purpose, trigger, owner, and place in sequence.
Start with routing by difficulty and decomposition; add verification and voting as traffic justifies them.
Plan-then-execute inserts a review gate between thinking and doing for high-stakes actions.
Every play needs an owner and instrumentation for quality, cost, and latency.
A monthly review keeps the playbook from rotting as models and inputs change.

Running Plays That Make Models Reason in Steps

Play 1: Decompose Before Answering

How It Looks

Play 2: Plan-Then-Execute Across Turns

Why the Split Matters

Play 3: Verify the Draft

Keeping Verification Honest

Play 4: Sample and Vote

The Cost Reality

Play 5: Route by Difficulty

Ownership and Maintenance

Sequencing the Plays Into a Pipeline

Assigning Owners

Instrumenting the Playbook

Reviewing the Numbers

Frequently Asked Questions

How many plays should a small team start with?

Who should own the reasoning playbook?

Can these plays run with any model?

How do I keep the playbook from going stale?

What is the single biggest mistake teams make here?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Running Plays That Make Models Reason in Steps

Play 1: Decompose Before Answering

How It Looks

Play 2: Plan-Then-Execute Across Turns

Why the Split Matters

Play 3: Verify the Draft

Keeping Verification Honest

Play 4: Sample and Vote

The Cost Reality

Play 5: Route by Difficulty

Ownership and Maintenance

Sequencing the Plays Into a Pipeline

Assigning Owners

Instrumenting the Playbook

Reviewing the Numbers

Frequently Asked Questions

How many plays should a small team start with?

Who should own the reasoning playbook?

Can these plays run with any model?

How do I keep the playbook from going stale?

What is the single biggest mistake teams make here?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?