Most teams treat temperature as a setting someone fiddles with once and forgets. That works until you have a dozen people shipping AI features, each with their own habits, and nobody can explain why one workflow produces stable output while another produces chaos. A playbook turns scattered intuition into shared practice.
This is an operating playbook, not a tutorial. It assumes you already understand what temperature and top-p do and want to run them across a team with clear plays, defined triggers, and named owners. Each play below answers three questions: when does it fire, who owns it, and what is the sequence.
The structure matters because output variety is a cross-cutting concern. It touches prompt design, evaluation, cost, and brand voice at once. Leaving it implicit means each of those stakeholders quietly assumes a different default.
Play One: Establish a House Default
The Trigger
Fire this play at the start of any new project or whenever a team has more than two people writing prompts. The symptom that you are overdue is inconsistency: similar tasks shipping with wildly different settings.
The Sequence
Pick a documented house default, typically a temperature near 0.7 with top-p at 1.0, as the starting point for exploratory work, plus a documented low-temperature profile near 0.2 for structured or extraction tasks. The point is not that these numbers are universally optimal; it is that everyone starts from the same place and deviates deliberately. The owner is whoever maintains your prompt standards, often a lead engineer or prompt specialist. For the reasoning behind defaults, see A Step-by-Step Approach to Temperature and Creativity Control.
Play Two: Classify the Task Before Setting the Dial
The Trigger
Fire this whenever a new prompt or feature is being built. The setting should follow from the task type, not from the author's mood.
The Sequence
Sort the task into one of three buckets. Deterministic tasks (extraction, classification, code, calculations) get a low temperature. Generative tasks (ideation, naming, first-draft copy) get a higher temperature. Hybrid tasks get a staged approach: high for divergence, low for convergence. The author owns the classification; a reviewer confirms it.
Why Classification Beats Tuning
Tuning by trial and error is slow and rarely documented. Classifying first gives you a defensible reason for the setting, which makes handoffs and reviews far cleaner. The common errors this prevents are catalogued in 7 Common Mistakes with Temperature and Creativity Control (and How to Avoid Them).
Play Three: Generate-Then-Select for Creative Work
The Trigger
Fire this whenever the deliverable is a single best output chosen from many possibilities, such as headlines, taglines, or campaign concepts.
The Sequence
Run a high-temperature pass to produce a broad candidate set, then a low-temperature or human pass to evaluate and select. Separate the two stages explicitly in your code or process so the variety and the quality control never fight each other. The creative lead owns selection; the engineer owns generation.
Sequencing Notes
Do not collapse generation and selection into one prompt at a medium temperature. You will get mediocre variety and mediocre judgment. The strength of this play comes from letting each stage do one job well.
Play Four: Lock Down Production Pipelines
The Trigger
Fire this before any prompt moves from experimentation into an automated, high-volume pipeline where no human reviews each output.
The Sequence
Lower the temperature to the minimum that still satisfies the task, pin the exact parameters in version control, and treat any change to them as a reviewed code change. Unreviewed production randomness is how subtle regressions slip in. The owner is the engineer responsible for the pipeline, with sign-off from whoever owns quality.
Connecting to Evaluation
A locked pipeline is only as trustworthy as the evaluation behind it. Pair this play with a regression suite that catches drift when models or settings change. The toolchain for this lives in The Best Tools for Temperature and Creativity Control.
Play Five: Run a Settings Review on a Cadence
The Trigger
Fire this on a recurring schedule, such as quarterly, or whenever you upgrade to a new model version.
The Sequence
Pull a sample of live outputs across your major workflows, check whether the temperature settings still produce the intended behavior, and adjust. Model upgrades in particular can change how a given temperature behaves, so a setting that was perfect six months ago may now be too tame or too wild. The quality owner runs the review; workflow owners implement changes.
Play Six: Document the Why, Not Just the Number
The Trigger
Fire this every time a setting is chosen or changed.
The Sequence
Record the temperature, the top-p, and a one-line rationale tied to the task type. A number with no reason is impossible to maintain; six months later nobody remembers whether 0.4 was deliberate or an accident. Documentation is the cheapest insurance against silent drift, and it makes the broader Best Practices That Actually Work far easier to apply.
Play Seven: Handle Model Migrations as Events
The Trigger
Fire this whenever you adopt a new model or a major model version. Treat the migration as a discrete event with its own checklist rather than a silent swap in a config file.
The Sequence
Re-run your representative inputs at the existing settings on the new model and compare behavior side by side. A temperature that produced disciplined output on the previous model may read as bland or as erratic on the new one, because each model has its own probability landscape. Recalibrate where behavior shifts, re-lock the parameters, and record what changed. The workflow owner runs the migration with the quality owner verifying results. Skipping this play is the single most common way a previously stable feature degrades after an upgrade.
Why It Earns Its Place
Model migrations are deceptively risky because nothing in the prompt changed, so teams assume nothing in the output will change either. The sampling behavior, however, is a property of the model as much as of the setting. Naming the migration as a play forces the comparison that catches the regression before customers do, and it connects directly to the regression discipline described in The Step-by-Step Approach.
Frequently Asked Questions
Who should own temperature decisions on a team?
Ownership splits by stage. A standards owner, usually a lead engineer or prompt specialist, maintains the house defaults and the classification scheme. Individual authors own the per-task classification and initial setting. A quality owner runs periodic reviews and approves production changes. The mistake is leaving it ownerless, which is how every developer ends up with a private, undocumented preference.
How is a playbook different from just picking good defaults?
Defaults are a single decision; a playbook is a system of decisions with triggers and owners. Defaults tell you where to start. The playbook tells you when to deviate, who decides, and how the choice gets reviewed and maintained over time. Without the surrounding plays, defaults erode the moment someone hits an edge case.
Does a playbook slow teams down?
In the short term it adds a small amount of structure; over time it removes far more friction than it adds. The slow part of ad hoc tuning is the rework, the inconsistency, and the debugging of mysterious output variance. A playbook front-loads a few cheap decisions to avoid those expensive ones.
How often should the settings review fire?
Quarterly is a reasonable baseline for most teams, with an additional review triggered by any model upgrade. High-volume production systems may warrant more frequent checks. The signal to increase cadence is finding meaningful drift each time you review; if reviews keep surfacing problems, you are reviewing too rarely.
Key Takeaways
- Treat output variety as a team-level operating concern with documented plays, triggers, and owners, not a per-prompt afterthought.
- Establish a house default and a task-classification scheme so settings follow from the work rather than from habit.
- Use generate-then-select for creative deliverables, keeping high-variety generation and low-variety selection as distinct stages.
- Lock down production pipelines with pinned, reviewed parameters and pair them with regression evaluation.
- Review settings on a cadence and after model upgrades, and always document the rationale behind each chosen number.