Temperature gets changed constantly and understood rarely. Most practitioners treat it like a volume knob — higher for "creative," lower for "accurate" — and leave it there. That mental model is too crude to produce reliable results, and it leads to real problems: outputs that hallucinate when they shouldn't, refuse to vary when variation is needed, or behave inconsistently across runs in ways that are hard to debug.
This article covers what temperature and sampling parameters actually do under the hood, how to set them with intention, and the specific configurations that hold up across common professional use cases. The reasoning matters as much as the recommendations, because every deployment has slightly different constraints and you need to be able to adapt, not just copy a table of settings.
If you're newer to how language models generate text at all, The Complete Guide to How Generative AI Works is a good primer to have open alongside this. If you already understand the basics and want to go further, the practices below represent what actually works in production — not in toy demos.
What Temperature Is Actually Doing
A language model doesn't "think" and then "write." It generates one token at a time, and at each step it produces a probability distribution over its entire vocabulary — tens of thousands of possible next tokens, each with an assigned likelihood. Temperature is applied to that distribution before the model samples from it.
Specifically, temperature divides the raw logits (the unnormalized scores) before they're passed through a softmax function. A temperature of 1.0 leaves the distribution unchanged. Lower temperatures sharpen the distribution — high-probability tokens become more dominant, low-probability ones get suppressed toward zero. Higher temperatures flatten the distribution — the gap between likely and unlikely tokens shrinks, so the model samples more adventurously.
At temperature 0 (or effectively 0), the model always picks the single highest-probability token. This is deterministic, but "deterministic" doesn't mean "correct" — it means the model commits fully to whatever path its training and context set it on, with no chance of recovering toward a different, equally valid phrasing.
The Practical Implication
Low temperature reduces variance. It doesn't reduce error — it just makes errors more consistent. If your prompt is slightly off, a low-temperature model will confidently produce the same wrong answer every time. High temperature increases variance. It can surface better outputs, but it can also surface wildly worse ones. Neither extreme is safe by default.
Sampling Methods Beyond Temperature
Temperature is one control. Most modern APIs expose several others, and they interact. Using them well requires understanding what each does.
Top-P (Nucleus Sampling)
Top-P sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold P, then samples only from that set. At top-p 0.9, the model considers only the tokens collectively accounting for 90% of the probability mass. This dynamically adjusts the candidate pool — when the model is confident, the pool is small; when it's uncertain, the pool expands.
Top-P is generally more reliable than temperature alone for controlling output quality because it adapts to context. A top-p of 0.85–0.95 combined with a moderate temperature (0.7–1.0) is a stable default for most generative tasks.
Top-K Sampling
Top-K restricts sampling to the K most probable tokens, regardless of how probability is distributed. At top-k 40, only the 40 highest-probability tokens are considered. This is blunter than top-P — the pool size is fixed whether the model is confident or not. Top-K is still used in some local model deployments but has mostly been superseded by top-P in commercial APIs. If you're using both, pick one or be very deliberate about the interaction.
Repetition and Frequency Penalties
These are often overlooked but matter significantly in longer outputs. A repetition penalty discourages the model from reusing tokens that have appeared recently. Frequency penalty (as implemented in OpenAI's API, for example) applies a progressive discount to tokens proportional to how often they've appeared in the output so far. Presence penalty applies a flat discount once a token has appeared at all.
For long-form content, a mild frequency penalty (0.1–0.3) prevents the verbal tics and looping that high-temperature models tend toward. For structured or factual outputs, leave these near zero — penalties can cause the model to avoid legitimate repeated terms.
Configuration by Use Case
There is no universal "best" temperature. There are well-reasoned defaults, and they differ by what you're asking the model to do. The framework is simple: the more constrained the correct answer space, the lower the temperature should be.
Factual Extraction and Classification
Temperature: 0–0.2. Top-P: 0.8–0.9. No penalties.
When you're pulling structured data from documents, classifying sentiment, or extracting named entities, you want the most probable token at each step. The correct answer has a narrow range; you don't want sampling variance. Note that even at temperature 0, long contexts can produce slightly different results due to floating-point non-determinism — if exact reproducibility matters, verify this in your specific deployment environment.
Summarization
Temperature: 0.3–0.5. Top-P: 0.85–0.95.
Summaries benefit from mild variation to avoid stilted phrasing, but too much temperature produces summaries that editorialize or hallucinate. This range keeps the model grounded in the source text while allowing natural sentence construction.
Copywriting and Marketing Content
Temperature: 0.7–1.0. Top-P: 0.9–0.95. Mild frequency penalty (0.1–0.2).
Here you want genuine variety. If you're generating five headline options or exploring different angles on a brief, higher temperature is the point — you're paying for variance. The frequency penalty keeps long-form copy from circling back on itself.
Code Generation
Temperature: 0.1–0.4. Top-P: 0.85–0.9. No penalties.
Code has correctness requirements that prose doesn't. Higher temperatures produce syntactically plausible but logically broken code more often than you'd like. Keep temperature low. If you want multiple candidate implementations, run the same low-temperature prompt multiple times rather than increasing temperature — you'll get more useful variation that way.
Conversational AI and Chatbots
Temperature: 0.6–0.8. Top-P: 0.9. Mild presence penalty (0.1).
You want natural, varied responses that don't feel scripted, but you also don't want the model going off in unexpected directions. This range produces fluid conversation without significant hallucination risk. The presence penalty helps prevent the model from leaning on the same phrases run after run.
The Failure Modes Worth Knowing
Most temperature-related problems in production fall into a handful of recognizable patterns.
Temperature too high for factual tasks. The model starts producing plausible-sounding but incorrect information more frequently. This is particularly insidious because high-temperature outputs often sound confident. If you're seeing inconsistent facts in outputs, lower temperature before any other intervention.
Temperature too low for creative tasks. The model converges on a single "most probable" phrasing for everything. Outputs become formulaic and repetitive across sessions. Users notice even if they can't name why — the content feels machine-made in the bad sense.
Conflicting parameter settings. Top-K and top-P constrain the token pool in different ways. Using both with aggressive values (low top-K and low top-P) can severely limit the model's options, producing choppy or repetitive text even at moderate temperatures.
Ignoring prompt structure. Temperature and sampling interact with prompt quality. A vague prompt at low temperature produces a confident, consistent wrong answer. Better to fix the prompt than to tune temperature to compensate. As discussed in Building a Repeatable Workflow for Large Language Models, prompt structure and model parameters should be tuned together, not in isolation.
How to Run Calibration Tests
Don't assume your defaults work. Test them against the actual task distribution you'll encounter in production.
The minimum viable calibration process:
- Pick 10–20 representative inputs that cover the range of your use case
- Run each at three temperature levels (low / medium / high relative to your starting assumption) with top-P fixed
- Score outputs on two dimensions: accuracy or relevance to task, and naturalness or variety (whichever is relevant)
- Identify the temperature at which accuracy degrades and the temperature below which outputs become repetitively robotic
- Set your default in the middle of the acceptable range, not at its edge
Track failures separately from median performance. A configuration with a high median score but catastrophic outliers isn't production-ready. See also The Large Language Models Playbook for how to build evaluation into your workflow rather than treating it as a one-time setup task.
Temperature Across Different Models
Temperature is not standardized across providers. A temperature of 0.7 in one API is not the same as 0.7 in another. The underlying logit scaling may differ, the tokenizer is different, and the model's inherent entropy (how "spread out" its probability distributions tend to be) varies by training.
This matters practically: if you're migrating prompts from one model to another, or comparing outputs between models, recalibrate temperature rather than porting it. Some models trained with RLHF or instruction tuning tend to have more compressed distributions at baseline — meaning their effective "default" behavior is already lower-entropy than a base model at the same nominal temperature. Large Language Models: The Questions Everyone Asks, Answered covers model architecture differences that inform why this happens.
When Not to Tune Temperature
There are scenarios where temperature tuning is the wrong tool entirely.
When the problem is in the prompt. If outputs are consistently wrong or off-target, that's usually a prompt or context problem, not a parameter problem. Temperature controls how the model samples from its predictions; it doesn't change what the model knows or how it interprets your instructions.
When you need structured output. If you're using function calling, JSON mode, or other structured output features, the model's sampling is often constrained by the format enforcement mechanism anyway. Temperature still applies, but its practical effect is reduced. Test rather than assume.
When you're using tool-use or agents. In agentic workflows, model decisions have downstream consequences that compound. Very low temperatures (0–0.3) are safer because the cost of an unexpected sampling choice isn't a single odd sentence — it's an action that may be hard to reverse. The considerations for agent reliability are explored further in The Future of Large Language Models, where the stakes of unpredictable sampling in autonomous systems become clearer.
Frequently Asked Questions
What temperature should I use for GPT-4 or Claude by default?
For general-purpose use, 0.7 with top-P 0.9 is a reasonable starting point for most providers. From there, adjust based on task: lower for factual or structured tasks, higher for creative work. Neither provider publishes an "official" universal default, and their playground interfaces use moderate defaults for good reason — they're designed for exploration, not production.
Does temperature 0 guarantee reproducible outputs?
In practice, no. Floating-point arithmetic on different hardware or inference batching can produce minor variations even at temperature 0. If exact reproducibility is a compliance or testing requirement, document this limitation and validate empirically in your deployment environment. Seed parameters, where available, provide more reliable reproducibility than temperature 0 alone.
Can I change temperature mid-conversation?
With most stateless APIs, each call is independent, so yes — you can set different temperature values for different turns or tasks within a session. This is actually useful: you might use low temperature for a tool-use turn and higher temperature for a summarization turn in the same workflow. Just track what settings you've used if you need to debug inconsistent behavior.
Does higher temperature cause more hallucination?
Yes, in general. Higher temperature increases the probability that low-confidence tokens get selected, which includes plausible-but-incorrect information. The effect is more pronounced for factual claims about specific entities, dates, or technical details than for stylistic or structural choices. This is the primary reason to use low temperatures for any output where factual accuracy is verified downstream.
What's the difference between temperature and top-P in practice?
Temperature reshapes the entire probability distribution before sampling. Top-P truncates the distribution after reshaping, keeping only the most probable tokens that sum to your threshold. In practice, top-P provides a useful ceiling on how far the model strays, even at higher temperatures. They're complementary controls, not alternatives — most production configurations use both.
Key Takeaways
- Temperature reshapes a probability distribution; it controls variance, not accuracy. Low temperature produces consistent outputs, including consistently wrong ones.
- Top-P nucleus sampling adapts dynamically to model confidence and is generally more reliable than top-K for most production tasks.
- Configure temperature to the task: 0–0.2 for extraction and classification, 0.3–0.5 for summarization, 0.7–1.0 for copywriting, 0.1–0.4 for code.
- Failure modes divide cleanly: too high produces confident hallucination; too low produces robotic, repetitive output.
- Temperature is not portable across models — recalibrate when switching providers or model versions.
- Run structured calibration tests before setting production defaults, and score for outliers, not just medians.
- When outputs are consistently wrong, fix the prompt before touching the parameters.