Set Temperature Wrong and Your Bot Invents Refund Policies

Temperature is one of those controls that looks deceptively simple — a slider from 0 to 2, a number in a config file — and gets misused constantly. Set it wrong and a customer-service bot hallucinates refund policies, or a creative writing tool produces the same metaphor seventeen times. Most guides tell you what temperature is. This one shows you what it does in practice, with specific scenarios, the settings that worked, and the settings that broke things.

The core mechanic: when a language model generates text, it converts its internal scores for possible next tokens into a probability distribution. Temperature reshapes that distribution before sampling happens. A low temperature sharpens the distribution — the highest-probability tokens dominate. A high temperature flattens it — lower-probability tokens get more of a chance. Sampling methods like top-p (nucleus sampling) and top-k layer on top of this, controlling which tokens are even eligible before one is chosen. Understanding them together is what separates operators who get consistent results from operators who wonder why the model seems "random."

The following scenarios are drawn from common deployment patterns across agency, enterprise, and product contexts. Each one walks through the decision, the setting, the outcome, and what adjusting the wrong parameter actually looked like in production.

Why Temperature Alone Isn't the Full Picture

Before the scenarios, one concept that's often glossed over: temperature and sampling aren't synonyms, and combining them thoughtlessly causes problems.

Temperature scales the logits (raw scores) before a softmax function converts them to probabilities. At temperature 0, the model becomes deterministic — it always picks the highest-scoring token. At temperature 2, the distribution is so flat it's nearly random.

Top-k sampling limits the model to choosing from only the k highest-probability tokens at each step. Set k to 50, and it ignores everything ranked 51st and below.

Top-p (nucleus) sampling is adaptive: it builds a cumulative probability pool and stops adding tokens once the pool hits the threshold (say, 0.9). On a confident step where one token has 95% probability, the nucleus might contain just that token. On an uncertain step, it might contain 200 tokens.

Temperature + top-p together is the most common production pairing. Temperature reshapes the distribution; top-p then decides how much of that distribution to sample from. Getting both right is what the following scenarios are actually about.

For a deeper grounding in how these mechanisms sit inside the full generation pipeline, see The Complete Guide to How Generative AI Works.

Scenario 1: Legal Document Summarization Gone Wrong

A mid-size law firm deployed a summarization tool for contract review. Their first configuration: temperature 1.0, top-p 0.95 — roughly the API defaults for a general-purpose assistant.

The outputs were fluent but unreliable. Dates were paraphrased into vague ranges. Specific liability caps were occasionally omitted or softened. In one case, an indemnification clause was summarized with a synonym that changed the legal implication entirely.

What the fix looked like

The team dropped temperature to 0.1 and top-p to 0.7. At those settings, the model anchored tightly to high-probability continuations — which, for a summarization task with clear source material, meant it stayed closer to the original phrasing. Factual accuracy on a 50-document test set improved measurably. The summaries became more repetitive in structure, but for this use case that was a feature, not a bug.

The lesson: For extraction and summarization tasks where fidelity to source material matters, low temperature (0.1–0.3) is almost always right. The model isn't being asked to create — it's being asked to compress and reproduce. Creativity is a liability.

Scenario 2: Brand Tagline Generation That Produced Nothing Usable

A marketing agency ran a tagline brainstorming session using temperature 0.3 and top-k 20. The client was a sustainable packaging company. The model produced variations on "packaging for a better tomorrow" across 40 outputs. Nearly identical sentence structures, different adjectives, no memorable line in the batch.

What was actually happening

The low temperature collapsed the distribution so aggressively that the model was essentially choosing from a tiny set of dominant tokens at every step. Top-k at 20 made it worse — 20 tokens is very restrictive on a vocabulary of 50,000+. The model had no access to the weirder, less-probable language that makes a tagline land.

The adjustment

Temperature raised to 1.2, top-k removed entirely, top-p set to 0.95. The next batch included lines that were genuinely novel — some unusable, several strong. The team learned to treat it as a divergent-thinking tool: run 60 outputs at high temperature, then manually filter down to 5 candidates.

The lesson: For ideation and creative generation, temperature 1.0–1.4 with a permissive top-p (0.9–0.95) creates real range. Expect noise. Build a filtering step into the workflow rather than trying to get the model to self-filter at low temperature — that's the wrong tool for the job.

Scenario 3: Customer Support Chatbot Giving Inconsistent Policy Answers

A SaaS company deployed a support bot trained on their help center documentation. Temperature was set at 0.7 — a "moderate creativity" setting that felt safe. Users asking the same question at different times got meaningfully different answers about refund windows and cancellation terms.

The failure mode

At temperature 0.7, there's still enough distribution flattening to invite variation, especially when the model has multiple plausible phrasings available. "30-day refund window" and "refund within 30 days" might both be high-probability, but so might "returns accepted up to a month" — a paraphrase that isn't technically wrong but doesn't match official policy language and erodes trust.

The fix

Temperature dropped to 0.0–0.2. The model became almost deterministic. Same question in, same phrasing out. The team also implemented a retrieval layer that injected the exact policy text before each query, which further anchored responses. Building a Repeatable Workflow for Large Language Models covers how retrieval augmentation and parameter choices interact in exactly this kind of deployment.

The lesson: Any application where consistency is a trust signal — support bots, policy explainers, compliance tools — should run at near-zero temperature. If the answer should be the same every time, the settings should make that inevitable.

Scenario 4: Code Generation at Scale

A software consultancy used an LLM to generate boilerplate code — unit test scaffolding, API endpoint stubs, configuration file templates. Initial setting: temperature 0.8.

The outputs were correct roughly 70% of the time. The failures weren't catastrophic errors; they were subtle: variable names that didn't follow the established codebase convention, a test assertion written in a less common pattern, an import statement pulling from a deprecated library path.

Dialing in the right range

For code generation, the ideal temperature range sits between 0.1 and 0.4 depending on task type:

Boilerplate and scaffolding: 0.1–0.2. The correct answer is well-defined. Variation is noise.
Algorithm implementation: 0.2–0.4. There may be more than one valid approach, but the model shouldn't wander.
Exploratory refactoring suggestions: 0.5–0.7. Here, some creative range is useful — you want to see alternative patterns.

The consultancy settled on 0.2 for production generation and 0.6 for a separate "suggest improvements" mode. Error rate on the primary task dropped substantially.

The lesson: Code is not one task. Different subtasks within coding have different optimal temperature ranges. Treat them as separate configurations.

Scenario 5: Structured Data Extraction from Unstructured Text

A research team was pulling structured fields — company name, funding round, investor names, deal date — from unstructured press releases and news articles. Temperature 0.5. The extraction was inconsistent: some runs included extra fields, some missed the investor list, some returned dates in different formats.

At temperature 0.5, the model had enough flexibility to decide between output formats. "Series A" vs. "Series A round" vs. "A round" all appeared. Dates came back as "March 2024," "03/2024," and "Q1 2024" from the same time period.

The configuration that stabilized it

Temperature to 0.0. Top-p to 0.7 (a small nucleus even for the remaining probability mass). A strict output schema in the system prompt with explicit format instructions. The combination made the model's extraction behavior nearly deterministic and format-consistent across hundreds of documents.

The lesson: Structured extraction is arguably the highest-stakes use case for low temperature. Any downstream system consuming the output — a database, a spreadsheet, an API — will break on format variance. Treat temperature 0 as the default for structured output tasks.

When to Push Temperature Higher

The scenarios above skew low, because most production failures come from temperature being too high for the task type. But there are legitimate high-temperature use cases:

Story and narrative generation: 1.0–1.4. You want the model to surprise you. Predictable story beats are a failure mode.
Humor and playful copy: 1.0–1.3. Comedy often lives in unexpected word choice.
Persona-driven roleplay or interactive fiction: 1.0–1.5. Character voice requires range.
Brainstorming raw ideas (pre-filtered): 1.2–1.5. Treat the output as raw material, not finished work.

At these levels, build explicit human review into the process. High temperature is a divergence tool, not a quality tool. See The Large Language Models Playbook for how to structure review loops around high-variance generation tasks.

The top-p and top-k Decision Tree

A practical heuristic for choosing between sampling methods:

Use top-p (0.9–0.95) for most tasks. It adapts to the model's confidence at each step, which is almost always what you want.
Use top-k when you want a hard ceiling on vocabulary breadth — for example, domain-specific outputs where you know the model should be choosing from a limited set.
Avoid combining very low top-k with high temperature — you get noisy outputs from a restricted pool, which is usually incoherent.
At temperature 0, sampling method is moot — the distribution collapses to the argmax, and top-p or top-k have no practical effect.

The direction the field is moving — toward more reasoning-capable models with internal sampling during chain-of-thought — adds complexity to these defaults. The Future of Large Language Models covers how sampling strategies are evolving alongside model architecture.

Frequently Asked Questions

What is the best temperature setting for most use cases?

There isn't a universal best — task type determines the right range. For factual retrieval, summarization, extraction, and anything policy-related, 0.0–0.3 is the right zone. For creative generation and brainstorming, 1.0–1.4 produces meaningful range. If you're genuinely unsure, start at 0.3 and increase only if outputs are too repetitive.

What's the difference between top-p and top-k sampling?

Top-k restricts the model to the k highest-probability tokens at each generation step, regardless of how concentrated or spread out the distribution is. Top-p (nucleus sampling) builds a pool of tokens that cumulatively account for p% of the probability mass — so the pool size adapts based on the model's confidence. Top-p tends to behave more consistently across different prompt types.

Does temperature affect how factual the model is?

Temperature doesn't directly change what the model "knows," but it affects which tokens it's willing to consider at each step. Higher temperatures give lower-probability continuations more of a chance — and in many cases, false or hallucinated content is lower-probability than accurate content. So high temperature can indirectly increase hallucination risk on factual tasks.

Can I use temperature to make the model more or less verbose?

Indirectly, yes. Lower temperatures tend to produce more predictable, often more concise outputs because the model chooses the most probable continuation at each step. But length is better controlled through explicit instructions in the system prompt or through max token limits. Using temperature as a proxy for length control is imprecise and can create other problems.

Should I change temperature between development and production?

Often, yes. During development you may want higher temperature to explore the range of possible outputs and understand failure modes. In production, most deployed systems benefit from lower temperature to ensure consistent, predictable behavior. Document your production settings and treat them as a configuration decision, not a one-time setup.

What happens if I set temperature above 2?

Most APIs cap temperature at 2.0. At that level, the distribution is nearly flat — token probabilities are almost equal, and outputs become largely incoherent. There's no practical use case for temperature above 1.5 in most professional applications. Think of 1.5 as the practical ceiling for high-creativity tasks, with everything above that being noise.

Key Takeaways

Temperature reshapes the token probability distribution; top-p and top-k determine which tokens are eligible for sampling. They work together, not in isolation.
Low temperature (0.0–0.3) is right for extraction, summarization, policy responses, code generation, and any task where consistency is a quality signal.
High temperature (1.0–1.4) is right for ideation, creative writing, and brainstorming — but requires a human filtering step downstream.
Top-p (0.9–0.95) is the more adaptive choice for most tasks. Top-k makes sense when you want a hard vocabulary ceiling.
The most common production failure is running creative-default settings (temperature 0.7–1.0) on precision tasks. Default to lower and raise deliberately.
Different subtasks within the same project — say, code generation vs. refactoring suggestions — should have different configurations. Don't use one setting for everything.
Build your workflow so temperature is a documented, revisable decision — not an afterthought buried in a config file.

Why Temperature Alone Isn't the Full Picture

Before the scenarios, one concept that's often glossed over: temperature and sampling aren't synonyms, and combining them thoughtlessly causes problems.

Top-k sampling limits the model to choosing from only the k highest-probability tokens at each step. Set k to 50, and it ignores everything ranked 51st and below.

For a deeper grounding in how these mechanisms sit inside the full generation pipeline, see The Complete Guide to How Generative AI Works.

Scenario 1: Legal Document Summarization Gone Wrong

A mid-size law firm deployed a summarization tool for contract review. Their first configuration: temperature 1.0, top-p 0.95 — roughly the API defaults for a general-purpose assistant.

What the fix looked like

Scenario 2: Brand Tagline Generation That Produced Nothing Usable

What was actually happening

The adjustment

Scenario 3: Customer Support Chatbot Giving Inconsistent Policy Answers

The failure mode

The fix

Scenario 4: Code Generation at Scale

A software consultancy used an LLM to generate boilerplate code — unit test scaffolding, API endpoint stubs, configuration file templates. Initial setting: temperature 0.8.

Dialing in the right range

For code generation, the ideal temperature range sits between 0.1 and 0.4 depending on task type:

Boilerplate and scaffolding: 0.1–0.2. The correct answer is well-defined. Variation is noise.
Algorithm implementation: 0.2–0.4. There may be more than one valid approach, but the model shouldn't wander.
Exploratory refactoring suggestions: 0.5–0.7. Here, some creative range is useful — you want to see alternative patterns.

The consultancy settled on 0.2 for production generation and 0.6 for a separate "suggest improvements" mode. Error rate on the primary task dropped substantially.

The lesson: Code is not one task. Different subtasks within coding have different optimal temperature ranges. Treat them as separate configurations.

Scenario 5: Structured Data Extraction from Unstructured Text

The configuration that stabilized it

When to Push Temperature Higher

The scenarios above skew low, because most production failures come from temperature being too high for the task type. But there are legitimate high-temperature use cases:

Story and narrative generation: 1.0–1.4. You want the model to surprise you. Predictable story beats are a failure mode.
Humor and playful copy: 1.0–1.3. Comedy often lives in unexpected word choice.
Persona-driven roleplay or interactive fiction: 1.0–1.5. Character voice requires range.
Brainstorming raw ideas (pre-filtered): 1.2–1.5. Treat the output as raw material, not finished work.

The top-p and top-k Decision Tree

A practical heuristic for choosing between sampling methods:

Use top-p (0.9–0.95) for most tasks. It adapts to the model's confidence at each step, which is almost always what you want.
Use top-k when you want a hard ceiling on vocabulary breadth — for example, domain-specific outputs where you know the model should be choosing from a limited set.
Avoid combining very low top-k with high temperature — you get noisy outputs from a restricted pool, which is usually incoherent.
At temperature 0, sampling method is moot — the distribution collapses to the argmax, and top-p or top-k have no practical effect.

Frequently Asked Questions

What is the best temperature setting for most use cases?

What's the difference between top-p and top-k sampling?

Does temperature affect how factual the model is?

Can I use temperature to make the model more or less verbose?

Should I change temperature between development and production?

What happens if I set temperature above 2?

Key Takeaways

Temperature reshapes the token probability distribution; top-p and top-k determine which tokens are eligible for sampling. They work together, not in isolation.
Low temperature (0.0–0.3) is right for extraction, summarization, policy responses, code generation, and any task where consistency is a quality signal.
High temperature (1.0–1.4) is right for ideation, creative writing, and brainstorming — but requires a human filtering step downstream.
Top-p (0.9–0.95) is the more adaptive choice for most tasks. Top-k makes sense when you want a hard vocabulary ceiling.
The most common production failure is running creative-default settings (temperature 0.7–1.0) on precision tasks. Default to lower and raise deliberately.
Different subtasks within the same project — say, code generation vs. refactoring suggestions — should have different configurations. Don't use one setting for everything.
Build your workflow so temperature is a documented, revisable decision — not an afterthought buried in a config file.

Set Temperature Wrong and Your Bot Invents Refund Policies

Why Temperature Alone Isn't the Full Picture

Scenario 1: Legal Document Summarization Gone Wrong

What the fix looked like

Scenario 2: Brand Tagline Generation That Produced Nothing Usable

What was actually happening

The adjustment

Scenario 3: Customer Support Chatbot Giving Inconsistent Policy Answers

The failure mode

The fix

Scenario 4: Code Generation at Scale

Dialing in the right range

Scenario 5: Structured Data Extraction from Unstructured Text

The configuration that stabilized it

When to Push Temperature Higher

The top-p and top-k Decision Tree

Frequently Asked Questions

What is the best temperature setting for most use cases?

What's the difference between top-p and top-k sampling?

Does temperature affect how factual the model is?

Can I use temperature to make the model more or less verbose?

Should I change temperature between development and production?

What happens if I set temperature above 2?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Set Temperature Wrong and Your Bot Invents Refund Policies

Why Temperature Alone Isn't the Full Picture

Scenario 1: Legal Document Summarization Gone Wrong

What the fix looked like

Scenario 2: Brand Tagline Generation That Produced Nothing Usable

What was actually happening

The adjustment

Scenario 3: Customer Support Chatbot Giving Inconsistent Policy Answers

The failure mode

The fix

Scenario 4: Code Generation at Scale

Dialing in the right range

Scenario 5: Structured Data Extraction from Unstructured Text

The configuration that stabilized it

When to Push Temperature Higher

The top-p and top-k Decision Tree

Frequently Asked Questions

What is the best temperature setting for most use cases?

What's the difference between top-p and top-k sampling?

Does temperature affect how factual the model is?

Can I use temperature to make the model more or less verbose?

Should I change temperature between development and production?

What happens if I set temperature above 2?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?