Same Mediocre Phrase, Three Runs Later: Sampling Beyond Temperature

If you've used language models long enough to know that temperature 0 makes outputs deterministic and temperature 1 makes them creative, you've cleared the first hurdle. But that knowledge doesn't explain why your "creative" prompt still produces the same mediocre phrase three runs in a row, why lowering temperature sometimes makes outputs worse, or how to squeeze genuinely useful behavior out of models without burning tokens on trial and error. Advanced model temperature and sampling means understanding the machinery well enough to diagnose failure modes and intervene deliberately.

This article is for practitioners who want the full picture: the probabilistic mechanics under the hood, the interaction effects between parameters, the cases where common advice is just wrong, and the decision logic professionals use when configuring models for real workflows. We'll cover top-p, top-k, min-p, repetition penalties, and the scenarios where combining parameters produces unexpected results. If you want to understand not just what these settings do but why they work the way they do, this is the right place.

What Temperature Actually Does to the Distribution

Temperature doesn't add randomness the way a dice roll does. It reshapes the probability distribution over the model's vocabulary before any token is sampled. Specifically, the raw output of a language model's final layer is a set of logits — unnormalized scores for every token in the vocabulary, which can number 32,000 to 100,000+ tokens depending on the model.

Softmax converts those logits into a probability distribution. Temperature divides the logits before softmax is applied. A temperature of 0.5 divides all logits by 0.5 (doubling them), which sharpens the distribution: high-probability tokens become much more probable, low-probability tokens become negligible. A temperature of 2.0 divides by 2 (halving them), flattening the distribution and making unlikely tokens far more competitive.

Why "Temperature 0 Is Deterministic" Needs a Caveat

In practice, temperature 0 is implemented as greedy decoding — the model always picks the single highest-probability token. This is deterministic per model version and hardware configuration, but different infrastructure, quantization levels, or model updates can change which token wins. If you're building a production workflow that depends on exact reproducibility, temperature 0 is necessary but not always sufficient. You also need to pin the model version and be aware of floating-point nondeterminism on different hardware.

The Low-Temperature Trap

Reducing temperature below 0.3–0.4 can cause repetitive, stilted outputs — not because the model is being more accurate, but because you're collapsing the distribution onto a small set of tokens that co-occur frequently in training data. You get the statistical average of all documents similar to your prompt, which is often bland and formulaic. This is particularly pronounced in open-ended writing tasks. For factual extraction or structured data generation, low temperature is appropriate. For anything requiring judgment or voice, it often isn't.

Top-p (Nucleus) Sampling: What the Truncation Actually Means

Top-p sampling, also called nucleus sampling, restricts sampling to the smallest set of tokens whose cumulative probability exceeds a threshold p. If p = 0.9, the model identifies the minimum set of tokens that together account for 90% of the probability mass, then samples only from that set.

The reason this exists is that vocabulary-level probability distributions are highly skewed. In most contexts, a few tokens hold the overwhelming majority of the probability — but in ambiguous or creative contexts, probability spreads across many more candidates. Top-p adapts dynamically to this variance. A fixed top-k (see below) doesn't.

The Interaction Between Temperature and Top-p

This is where practitioners make consistent mistakes. Temperature reshapes the distribution before top-p truncates it. Set temperature to 1.5 and top-p to 0.9, and you've first flattened the distribution (temperature), then cut off the long tail at 90% (top-p). The resulting nucleus is much larger than it would be at temperature 1.0 with top-p 0.9, because flattening the distribution spreads mass across more tokens before truncation.

The practical upshot: these parameters are not independent levers. When you raise temperature, you should generally lower top-p to compensate, and vice versa. Defaults like temperature 1.0 and top-p 1.0 represent a specific design choice by the API provider, not a neutral baseline.

Top-k Sampling: Where It Fits and Where It Doesn't

Top-k restricts sampling to the k most probable tokens at each step, regardless of their probabilities. K = 50 means the model samples from only the 50 highest-scoring tokens every time.

The weakness is that k is context-blind. In a high-certainty context (e.g., completing "The capital of France is"), 50 viable candidates is absurd — there's one right answer and 49 noise tokens. In a low-certainty context (e.g., starting a creative metaphor), 50 may not be enough. Top-p adapts; top-k doesn't.

Where top-k still earns its place: local inference and fine-tuned specialty models where you've characterized the distribution in advance, or as a hard ceiling above top-p (e.g., "never sample from more than 200 tokens regardless of p"). Some practitioners use top-k as a safety backstop even when top-p is their primary control, setting k at something like 100–200 to prevent pathological token selections in edge cases.

Min-p: The Underused Parameter Worth Understanding

Min-p is a newer sampling method adopted in several open-source model runtimes (notably llama.cpp and its derivatives). Instead of truncating by cumulative probability (top-p) or absolute count (top-k), min-p sets a floor based on the most probable token's probability. Only tokens with probability greater than (p × max_probability) are eligible.

If the most likely token has a 60% probability and min-p is 0.1, then any token with probability above 6% is eligible. If the top token has 20% probability, only tokens above 2% qualify. The nucleus shrinks and expands with the model's own confidence.

This makes min-p particularly effective for high-temperature creative generation. At temperature 1.5, top-p struggles to prevent genuinely incoherent tokens from sneaking in, because flattening the distribution raises even terrible tokens above most fixed thresholds. Min-p, anchored to the current distribution's peak, stays proportionally tight. If you're working with open-source models and your runtime supports it, min-p at values around 0.05–0.1 with elevated temperature is worth testing against your top-p defaults.

Repetition Penalties and Frequency Penalties: The Overlooked Dimension

Temperature and sampling control breadth of token selection. Repetition and frequency penalties control history dependence — how much the model discounts tokens it has already generated.

A repetition penalty (common in open-source runtimes) divides the logit of any token that has appeared in the context by a penalty factor > 1.0. A value of 1.2 meaningfully suppresses repeated tokens; 1.5 suppresses them aggressively and can cause the model to avoid necessary words like "the" or "and."

OpenAI's API exposes frequency penalty (scales with how many times a token appeared) and presence penalty (flat penalty for any token that appeared at all). These are not the same thing, and conflating them leads to misconfiguration. Use presence penalty to encourage topic novelty; use frequency penalty to reduce phrase-level repetition.

The interaction with temperature is significant: at high temperature, repetition penalties can compound unpredictably, pushing the model toward rare tokens that produce incoherence. At low temperature, a moderate repetition penalty can help break the formulaic loops that temperature suppression tends to create. Understanding how generative AI works at the architectural level makes these interaction effects easier to reason about, because you can see that penalties modify logits in the same pass as temperature scaling.

Sampling Strategy by Task Type

The right configuration depends heavily on what you're optimizing for. Here is a practical map:

Factual extraction, structured output, classification: Temperature 0–0.2, top-p 0.9 or lower, no repetition penalty. You want the high-probability answer, not distribution exploration.
Summarization and editing: Temperature 0.3–0.5, top-p 0.85–0.95. Some flexibility in phrasing, but factual grounding matters.
Copywriting and marketing: Temperature 0.7–0.9, top-p 0.9–0.95, light presence penalty (0.1–0.3). You want variety across outputs and some novelty of expression.
Brainstorming, ideation, creative fiction: Temperature 0.9–1.3, top-p 0.95, or use min-p if available. Presence penalty can help avoid returning to the same conceptual territory.
Code generation: Temperature 0.1–0.4. Code is high-certainty domain; most tokens have a right answer given context. High temperature generates valid-looking syntax that doesn't work.

The trade-offs and decision framework for generative AI more broadly applies here too: there's no universally optimal configuration, only configurations that make explicit trade-offs among accuracy, diversity, and coherence.

Diagnosing Sampling Failures in Production

When outputs degrade, most practitioners adjust temperature reflexively. Often the actual problem is elsewhere.

Outputs are repetitive but not looping: Usually a low-temperature artifact or an insufficient repetition penalty. Try raising temperature modestly before adding penalty.

Outputs are incoherent or introduce hallucinated entities: Temperature too high, or top-p too permissive given that temperature. Lower temperature before lowering top-p; they're not equivalent interventions.

Outputs are consistently mediocre and "safe": Classic mode collapse from very low temperature. The model is anchoring on statistically average continuations. Raise temperature to 0.6–0.8 even for "factual" tasks and see if output quality improves.

Outputs vary wildly between runs with the same prompt: Check whether your system is pinning the model version. Unexpected variation sometimes reflects infrastructure changes, not configuration drift. The tools you're using to interface with models matter here — different API clients handle parameter defaults differently.

Structured outputs (JSON, YAML) occasionally malform: Consider using constrained decoding if your runtime supports it (grammar-based sampling), rather than trying to solve this with temperature alone. Temperature affects all tokens equally; structured formats need token-level constraints that sampling parameters can't provide. The metrics that matter for generative AI outputs include format validity rates, which help quantify this failure mode.

Frequently Asked Questions

Does temperature affect how "smart" the model is?

No. Temperature doesn't change the model's underlying knowledge or reasoning capacity — it only changes how the model samples from its learned distribution. A high-temperature output may appear more creative but is actually drawing from a broader, less confident set of continuations. Intelligence is fixed by training; temperature controls output diversity.

Can I set temperature above 2.0 and what happens?

Most APIs cap temperature at 2.0, and for good reason. Above roughly 1.5–1.8, distributions become so flat that the model is effectively sampling near-randomly from the vocabulary, producing outputs that are syntactically plausible (because the model still has positional and grammatical priors) but semantically chaotic. There are essentially no productive use cases for temperature above 2.0.

Why do different providers produce different outputs at the same temperature?

Providers vary in model architecture, tokenizer, quantization, system prompt defaults, and sometimes in how they implement temperature scaling. Temperature 0.7 on one provider's model is not the same as temperature 0.7 on another's. Always calibrate configurations empirically against the specific model and provider you're using, rather than transferring numbers directly. Reviewing how generative AI works at a foundational level helps clarify why these differences exist.

Should I always use both top-p and temperature together?

Not necessarily. Many practitioners use temperature as the primary dial and leave top-p at 1.0 (no truncation), relying on temperature alone to control diversity. Others use top-p at 0.9–0.95 and leave temperature at 1.0. The combination of both can produce refined control, but it also increases the interaction complexity. Start with one parameter, characterize its effect on your specific task, then layer the second if needed.

What's the relationship between sampling and prompt engineering?

They're complementary, not substitutes. A well-crafted prompt narrows the likely-completion space by establishing clear context; sampling parameters then control how the model explores whatever space remains. You can partially compensate for poor prompts with sampling configuration, but it's inefficient — better to invest in prompt clarity first, then tune sampling for the remaining variance.

Is there a way to get both consistency and creativity in the same output?

Yes, with careful prompting and moderate temperature (0.6–0.8). A useful pattern is to generate multiple completions at moderate temperature and select or blend the best, rather than trying to find a single temperature that does both. Some agentic workflows use a low-temperature "evaluator" pass to select among outputs generated at higher temperature, giving you the benefits of diversity in generation and consistency in selection.

Key Takeaways

Temperature reshapes logits before softmax; it doesn't add noise, it alters the probability distribution itself.
Temperature 0 is deterministic per model version, not universally reproducible — pin model versions in production.
Top-p and temperature interact: raising one effectively changes what the other does. Treat them as a system, not independent controls.
Top-k is context-blind; use it as a safety ceiling or in well-characterized distributions, not as a primary control.
Min-p, available in open-source runtimes, scales the nucleus relative to the peak probability and outperforms top-p at high temperatures.
Repetition penalty and frequency penalty operate on logits similarly to temperature but along the history dimension — they're not interchangeable, and both interact with temperature in production.
Task type should drive your configuration baseline; no single setting works across factual, structured, and creative use cases.
When diagnosing degraded outputs, resist adjusting temperature reflexively — identify whether the failure is a diversity, coherence, or repetition problem first.

What Temperature Actually Does to the Distribution

Why "Temperature 0 Is Deterministic" Needs a Caveat

The Low-Temperature Trap

Top-p (Nucleus) Sampling: What the Truncation Actually Means

The Interaction Between Temperature and Top-p

Top-k Sampling: Where It Fits and Where It Doesn't

Top-k restricts sampling to the k most probable tokens at each step, regardless of their probabilities. K = 50 means the model samples from only the 50 highest-scoring tokens every time.

Min-p: The Underused Parameter Worth Understanding

Repetition Penalties and Frequency Penalties: The Overlooked Dimension

Temperature and sampling control breadth of token selection. Repetition and frequency penalties control history dependence — how much the model discounts tokens it has already generated.

Sampling Strategy by Task Type

The right configuration depends heavily on what you're optimizing for. Here is a practical map:

Factual extraction, structured output, classification: Temperature 0–0.2, top-p 0.9 or lower, no repetition penalty. You want the high-probability answer, not distribution exploration.
Summarization and editing: Temperature 0.3–0.5, top-p 0.85–0.95. Some flexibility in phrasing, but factual grounding matters.
Copywriting and marketing: Temperature 0.7–0.9, top-p 0.9–0.95, light presence penalty (0.1–0.3). You want variety across outputs and some novelty of expression.
Brainstorming, ideation, creative fiction: Temperature 0.9–1.3, top-p 0.95, or use min-p if available. Presence penalty can help avoid returning to the same conceptual territory.
Code generation: Temperature 0.1–0.4. Code is high-certainty domain; most tokens have a right answer given context. High temperature generates valid-looking syntax that doesn't work.

Diagnosing Sampling Failures in Production

When outputs degrade, most practitioners adjust temperature reflexively. Often the actual problem is elsewhere.

Outputs are repetitive but not looping: Usually a low-temperature artifact or an insufficient repetition penalty. Try raising temperature modestly before adding penalty.

Frequently Asked Questions

Does temperature affect how "smart" the model is?

Can I set temperature above 2.0 and what happens?

Why do different providers produce different outputs at the same temperature?

Should I always use both top-p and temperature together?

What's the relationship between sampling and prompt engineering?

Is there a way to get both consistency and creativity in the same output?

Key Takeaways

Temperature reshapes logits before softmax; it doesn't add noise, it alters the probability distribution itself.
Temperature 0 is deterministic per model version, not universally reproducible — pin model versions in production.
Top-p and temperature interact: raising one effectively changes what the other does. Treat them as a system, not independent controls.
Top-k is context-blind; use it as a safety ceiling or in well-characterized distributions, not as a primary control.
Min-p, available in open-source runtimes, scales the nucleus relative to the peak probability and outperforms top-p at high temperatures.
Repetition penalty and frequency penalty operate on logits similarly to temperature but along the history dimension — they're not interchangeable, and both interact with temperature in production.
Task type should drive your configuration baseline; no single setting works across factual, structured, and creative use cases.
When diagnosing degraded outputs, resist adjusting temperature reflexively — identify whether the failure is a diversity, coherence, or repetition problem first.

Same Mediocre Phrase, Three Runs Later: Sampling Beyond Temperature

What Temperature Actually Does to the Distribution

Why "Temperature 0 Is Deterministic" Needs a Caveat

The Low-Temperature Trap

Top-p (Nucleus) Sampling: What the Truncation Actually Means

The Interaction Between Temperature and Top-p

Top-k Sampling: Where It Fits and Where It Doesn't

Min-p: The Underused Parameter Worth Understanding

Repetition Penalties and Frequency Penalties: The Overlooked Dimension

Sampling Strategy by Task Type

Diagnosing Sampling Failures in Production

Frequently Asked Questions

Does temperature affect how "smart" the model is?

Can I set temperature above 2.0 and what happens?

Why do different providers produce different outputs at the same temperature?

Should I always use both top-p and temperature together?

What's the relationship between sampling and prompt engineering?

Is there a way to get both consistency and creativity in the same output?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Same Mediocre Phrase, Three Runs Later: Sampling Beyond Temperature

What Temperature Actually Does to the Distribution

Why "Temperature 0 Is Deterministic" Needs a Caveat

The Low-Temperature Trap

Top-p (Nucleus) Sampling: What the Truncation Actually Means

The Interaction Between Temperature and Top-p

Top-k Sampling: Where It Fits and Where It Doesn't

Min-p: The Underused Parameter Worth Understanding

Repetition Penalties and Frequency Penalties: The Overlooked Dimension

Sampling Strategy by Task Type

Diagnosing Sampling Failures in Production

Frequently Asked Questions

Does temperature affect how "smart" the model is?

Can I set temperature above 2.0 and what happens?

Why do different providers produce different outputs at the same temperature?

Should I always use both top-p and temperature together?

What's the relationship between sampling and prompt engineering?

Is there a way to get both consistency and creativity in the same output?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?