Every time you call a language model, you are making decisions about how the output gets sampled, whether you realize it or not. Leave the parameters at their defaults and you have still chosen something. The trouble is that the defaults were tuned for a generic chat experience, and your task is rarely generic. A contract clause extractor and a brand tagline generator want opposite settings, yet teams routinely ship both with whatever the SDK left in place.
The decision space looks small from the outside. Temperature, top-p, frequency and presence penalties, and a max-token cap account for most of it. But these knobs interact, and the interactions are where most people get burned. Turning two of them up at once can produce output that looks creative in a demo and falls apart in production.
This article lays out the competing approaches to sampling control, the axes that actually matter, and a decision rule you can apply without running a week of experiments first. The goal is not to find one perfect configuration but to stop guessing.
The Axes That Actually Matter
Before comparing settings, it helps to name what you are trading between. Almost every sampling decision moves you along one of three axes.
Determinism Versus Variety
Determinism is how repeatable the output is for the same input. A classification prompt should return the same label every time. A brainstorming prompt should return something fresh on each run. Temperature is the primary lever here: low values concentrate probability on the most likely tokens, high values flatten the distribution so less likely tokens get a real chance.
Coherence Versus Surprise
Surprise is not the same as variety. You can get varied output that is still coherent, or you can push so hard that the model emits grammatically valid nonsense. Top-p controls this more gracefully than temperature because it truncates the long tail of improbable tokens before sampling, keeping the model from wandering into genuinely bad choices even when you want range.
Repetition Versus Drift
Frequency and presence penalties address a narrow but real problem: models that loop on the same phrase or fixate on one idea. Penalties discourage tokens that have already appeared. Used lightly they reduce repetition. Used heavily they force the model to drift onto unrelated topics just to avoid words it already used.
The Competing Approaches
Default-And-Pray
The most common approach is to leave everything alone. This is fine for exploratory work and terrible for anything you measure. The defaults bias toward a middle-of-the-road chat feel, which means deterministic tasks come out slightly too loose and creative tasks come out slightly too safe. If you have ever wondered why your extraction pipeline occasionally hallucinates a field, default temperature is a frequent culprit.
Temperature-Only Tuning
A step up is to treat temperature as the single dial. Lower it for structured tasks, raise it for open ones. This works well enough that many teams never go further, and for a lot of use cases that is the right call. The limitation shows up at the extremes. Very high temperature without a top-p cap lets the worst tokens through, and very low temperature can make output feel robotic even when a little life would help.
Combined Top-p And Temperature
The more robust approach holds temperature moderate and uses top-p to manage the risk of bad tokens. A common pattern is moderate temperature with top-p around 0.9, which gives variety while truncating the tail. The cost is added complexity: now you have two interacting knobs, and changing one shifts the effect of the other. Our walkthrough in A Step-by-Step Approach to Temperature and Creativity Control shows how to isolate them during testing.
Prompt-Led Control
A fourth school argues you should keep sampling conservative and push variety through the prompt itself: ask for five distinct options, specify tone, constrain format. This trades parameter risk for prompt-engineering effort and tends to be more auditable, since the request is explicit rather than statistical.
A Decision Rule You Can Apply
Most teams want a rule, not a research project. Here is one that holds up.
Start From The Failure You Fear Most
If the worst outcome is an inconsistent or wrong answer, bias toward determinism: low temperature, no penalties, tight format instructions. If the worst outcome is bland or repetitive output, bias toward variety: moderate temperature with a top-p cap, and a small presence penalty if you see looping.
Change One Knob At A Time
Never adjust temperature and top-p in the same pass. You will not know which one moved the result. Hold one fixed, sweep the other, and write down what you see. This discipline is the single biggest difference between teams that tune effectively and teams that thrash, a point we expand on in 7 Common Mistakes with Temperature and Creativity Control (and How to Avoid Them).
Match The Setting To The Task, Not The Model
The right configuration is a property of the task, not a global preference. A single application often runs several prompts, and they should not share one temperature. Tag each prompt with its intended behavior and set parameters per call.
Validate On Outcomes, Not Vibes
A setting that looks creative in three demo runs may be unstable across a thousand. Decide on real samples and a metric you trust. If you do not yet have metrics, the companion piece How to Measure Temperature and Creativity Control: Metrics That Matter gives you a starting instrument.
Putting It Together In Practice
Consider three concrete tasks. A support-ticket classifier should run near-deterministic: low temperature, no penalties, a closed list of labels. A product-description writer wants moderate temperature with a top-p cap so it stays on-brand while varying phrasing. A naming brainstorm wants higher temperature, a top-p cap to avoid garbage, and a presence penalty to keep ideas from clustering.
Notice that none of these is the default, and none of them is the same. The discipline is recognizing that the dial position is a design decision tied to a specific job. Once you frame it that way, the trade-offs stop feeling mysterious and start feeling like ordinary engineering choices with known costs.
When The Same Prompt Needs Two Behaviors
A harder case is a single response that contains both a part that must be exact and a part that should be expressive, an extracted figure followed by a written summary, for example. No single temperature serves both, and a compromise underperforms at both ends. The better move is to split the work: run the exact part deterministically and the expressive part loosely, even if that means two calls. Recognizing when a task is secretly two tasks is one of the more valuable instincts you can develop, and it is the through-line of Advanced Temperature and Creativity Control.
Common Ways The Decision Goes Wrong
Copying A Setting Across Tasks
The most frequent mistake is finding a temperature that works for one prompt and reusing it everywhere. Because the right value is a property of the task, a setting that is perfect for a classifier is wrong for a generator. Resist the urge to standardize on a single global value; standardize on the process of choosing, not on the number itself.
Mistaking Randomness For Range
Teams chasing creativity often push temperature until the output looks unpredictable, then call that creativity. Unpredictable is not the same as good. Genuine range comes from a strong prompt asking for distinct options at a moderate temperature, not from a high temperature that occasionally emits nonsense. The line between range and noise is exactly what the coherence-versus-surprise axis tracks.
Leaving No Record Of Why
A setting chosen without a recorded reason becomes a mystery the moment someone revisits it. Six months later nobody knows whether a temperature was deliberate or accidental, and changing it feels risky. Writing one sentence of justification next to each setting is the cheapest insurance against this, and it is the seed of the team standard described in Rolling Out Temperature and Creativity Control Across a Team.
Frequently Asked Questions
Should I change temperature or top-p first?
Start with temperature because its effect is easier to reason about: it directly controls how flat the probability distribution is. Once you have a temperature that feels close, add top-p to truncate the bad tail rather than to add variety. Changing both at once makes the effect of each impossible to attribute.
Is a higher temperature always more creative?
No. Beyond a point, higher temperature trades coherence for randomness, and randomness is not creativity. Genuinely creative output usually comes from moderate temperature paired with a strong prompt that asks for range. Past the coherence threshold you mostly get errors that happen to look novel.
Do penalties hurt quality?
Light frequency or presence penalties reduce repetition with little downside. Heavy penalties force the model to avoid words it has already used, which can push it off-topic or degrade fluency. Use the smallest penalty that fixes the looping you actually observe.
Can I just leave everything at defaults?
For casual or exploratory use, defaults are fine. For anything you measure, ship, or bill a client for, defaults are a liability because they are tuned for generic chat, not your specific task. At minimum, set temperature deliberately per prompt.
Key Takeaways
- Sampling decisions move along three axes: determinism versus variety, coherence versus surprise, and repetition versus drift.
- The four approaches range from leaving defaults alone to combining temperature with top-p to driving variety through the prompt itself.
- Choose settings from the failure you fear most, then change one knob at a time.
- Configuration is a property of the task, not the model, so set parameters per prompt rather than globally.
- Validate on real samples and a trusted metric, not on a handful of demo runs.