Case Study: Model Temperature and Sampling in Practice

A content strategist at a mid-size digital agency gets a new client: a regional hospital network that needs two very different AI writing tools. One tool generates patient-facing FAQ answers — clear, accurate, legally defensible. The other generates campaign concepts for a fundraising gala — surprising, emotionally resonant, genuinely creative. The underlying model is the same GPT-4-class API. The budget is the same. The deadline is the same. The only lever that changes everything between those two outputs is temperature and sampling configuration.

This is not an abstract concept. Temperature and sampling parameters are the difference between a model that sounds like a liability and one that sounds like a creative director. Most practitioners either leave them at defaults — usually 1.0 — or adjust them by feel until something looks right. Neither approach scales. What follows is a case study structured around a real decision process: the situation, the reasoning, the execution, and the measurable outcomes. If you want to understand how these parameters work before diving into the applied layer, The Complete Guide to How Generative AI Works is the right starting point.

The Situation: Two Use Cases, One Model

The agency — call it Meridian Creative — had licensed API access to a frontier model and was building both tools in parallel. The patient FAQ tool needed to pull from a retrieval-augmented knowledge base of hospital policy documents and return consistent, predictable answers. The campaign ideation tool needed to surprise the creative team with angles they hadn't already considered. Running both at default settings produced mediocre results in both directions: the FAQ tool occasionally hallucinated medication dosage ranges, and the ideation tool kept returning the same three concepts in slightly different orders.

The account lead escalated to their AI implementation specialist, who recognized this as a temperature and sampling problem before it became a prompt engineering rabbit hole.

What Temperature and Sampling Actually Control

Temperature is a scalar applied to the model's output logits — the raw numerical scores assigned to every possible next token — before the probability distribution is calculated. At temperature 0, the model always picks the highest-probability token, producing deterministic, often repetitive output. At temperature 2.0, the distribution flattens dramatically, giving low-probability tokens a much larger share of selection weight. The practical range most teams work in is 0.0 to 1.5; beyond 1.5, outputs frequently become incoherent.

Top-p sampling (nucleus sampling) works differently. Rather than adjusting the whole distribution, it defines a probability mass threshold — say, 0.9 — and samples only from the smallest set of tokens whose cumulative probability reaches that threshold. At any given moment, that might be 10 tokens or 200, depending on how confident the model is. Top-k sampling is a blunter version: it simply caps the candidate pool at k tokens regardless of probability mass.

Why These Parameters Interact

Temperature and top-p are usually applied together, which creates confusion. A common error is maxing out both simultaneously. The two parameters compound: high temperature flattens the distribution, and a high top-p value then samples from that already-flat pool, producing near-random outputs. The standard guidance from most API providers is to tune one parameter at a time and leave the other near its default. For most practitioners, the recommended pairing is:

Precision tasks: temperature 0.0–0.3, top-p 0.9–1.0
Balanced tasks: temperature 0.7, top-p 0.9
Creative tasks: temperature 0.9–1.2, top-p 0.95

Frequency penalty and presence penalty are secondary levers. Frequency penalty reduces the probability of repeating tokens that have already appeared, proportional to how often they've appeared. Presence penalty applies a fixed discount to any token that has appeared at all. For the FAQ use case, low frequency penalty keeps answers terse and avoids circular phrasing. For ideation, a moderate presence penalty pushes the model toward less-repeated concepts across a generation batch.

The Decision Process: Matching Parameters to Task Anatomy

Meridian's implementation specialist built a decision framework before touching any sliders. The core questions:

What is the cost of an unexpected output? For patient health information, unexpected = dangerous. For fundraising concepts, unexpected = valuable.
Does this task have a demonstrably correct answer? FAQ answers do; campaign slogans do not.
Will outputs be human-reviewed before use? Both tools had review steps, but the FAQ tool had a clinician review and the campaign tool had a creative director review — different error tolerances.
Is variety across a batch of outputs desirable or problematic? The FAQ tool needed consistency across identical queries. The ideation tool needed differentiation across a single generation run.

This kind of structured pre-configuration thinking is central to Building a Repeatable Workflow for Large Language Models, and Meridian had already adopted a version of that process. The temperature and sampling decision became a documented parameter block, not an ad hoc setting.

Execution: The FAQ Tool

Configuration

Temperature: 0.1
Top-p: 1.0 (effectively disabled as a constraint given the near-zero temperature)
Frequency penalty: 0.2
Presence penalty: 0.0
Max tokens: 280 (to force concision)

At temperature 0.1, the model reliably selected high-probability completions. The slight non-zero value (rather than exactly 0.0) was intentional: fully deterministic generation occasionally produces stilted, robotic phrasing. A value of 0.1 preserves natural sentence rhythm while keeping the model firmly on-distribution.

The Failure Mode It Solved

Before this configuration, the team had been running at temperature 0.8. The model was drawing on general training knowledge rather than staying anchored to the retrieved hospital documents. For questions like "What is the typical recovery time after a knee replacement?" the model was producing plausible-sounding ranges — 4 to 6 weeks for most activities — that contradicted the hospital's specific post-operative protocols. At temperature 0.1, combined with an explicit retrieval prompt structure, the model stayed within the retrieved context more reliably.

The clinician reviewer reported that flagged factual deviations dropped from roughly 1 in 8 outputs to 1 in 40 over a two-week review period — a significant operational improvement, though not a guarantee of zero error.

Remaining Limitation

Low temperature does not prevent hallucination. It reduces the probability that the model drifts toward lower-probability completions, but if the high-probability completion is itself wrong or the retrieval step surfaces bad context, temperature cannot fix that. The team accepted this as a known constraint and kept the clinician review gate in place.

Execution: The Campaign Ideation Tool

Configuration

Temperature: 1.1
Top-p: 0.95
Frequency penalty: 0.0
Presence penalty: 0.4
Max tokens: 600
n (parallel completions): 5

The presence penalty of 0.4 was the key lever here, not the temperature alone. The team wanted five distinct concept directions per run, not five variations on the same theme. Presence penalty discouraged the model from reusing the same thematic anchors — "community," "healing," "together" — across the batch. Temperature 1.1 kept the distribution broad enough to surface lower-probability creative angles without tipping into incoherence.

What Changed in Practice

Before optimization, the creative director's feedback on AI-generated concepts was consistent: "These all feel like the same idea." The team ran an informal evaluation — presenting 20 concept sets (each set of 5 generated in one API call) to two senior copywriters without telling them which were pre- or post-optimization. The post-optimization sets were rated as "meaningfully distinct across the set" in roughly 14 of 20 cases versus 4 of 20 pre-optimization. This is not a controlled study; it is a practitioner's signal-to-noise check. It was enough to confirm direction.

The tool also started surfacing unusual structural angles — a campaign concept framed as a letter from a future patient, another structured around a countdown — that the creative team actually used in client presentations.

The Cross-Cutting Lesson: Parameters Serve Task Anatomy, Not Vice Versa

The biggest mistake Meridian had made initially was treating temperature as a quality dial: higher for "better," lower for "safer." Temperature is not a quality control. It is a probability distribution shaping tool. The right configuration depends entirely on the shape of the task.

Understanding this sits at the foundation of how generative models work. If you want the underlying mechanics explained from first principles, How Generative AI Works: A Beginner's Guide covers the token prediction loop that makes temperature meaningful, and A Step-by-Step Approach to How Generative AI Works walks through the generation pipeline in more operational detail.

A few durable principles the case surfaces:

Document your parameter choices. If you can't articulate why you set temperature to 0.7 rather than 0.5, you can't reproduce it, debug it, or train a team member to maintain it.
Test at the extremes before settling in the middle. Run the same prompt at 0.0, 0.5, 1.0, and 1.5. The distribution of outputs across those four settings tells you more about the task's sensitivity than any rule of thumb.
Presence penalty is underused. Most practitioners adjust temperature and forget presence/frequency penalty entirely. For batch generation tasks — ideation, content variation, brainstorming — presence penalty is often the more important lever.
Low temperature is not a trust substitute. Deterministic outputs still require human review for high-stakes applications. Temperature reduces a class of errors; it does not eliminate the need for judgment.

The trajectory of where model APIs are heading — more fine-grained control, structured outputs, stronger grounding — is worth tracking. The Future of Large Language Models covers some of that direction and helps contextualize why parameter literacy will remain relevant even as tooling evolves.

Frequently Asked Questions

What is the best temperature setting for most business writing tasks?

For most professional writing — summaries, emails, reports, structured documents — a temperature between 0.5 and 0.7 with top-p at 0.9 performs well. This range produces fluent, varied prose without the distributional chaos of higher settings. Start at 0.7 and adjust based on whether outputs feel repetitive (raise it) or unpredictable (lower it).

Can I use temperature 0 for factual accuracy?

Temperature 0 produces deterministic output, but deterministic does not mean accurate. The model will consistently return whatever its highest-probability completion is, which may still be factually wrong. For high-stakes factual applications, combine low temperature with retrieval-augmented generation and a human review step.

How is top-p different from top-k, and which should I use?

Top-p samples from a dynamic pool defined by cumulative probability mass; top-k samples from a fixed number of candidates. Top-p adapts to the model's confidence at each token, making it more context-sensitive. Most practitioners default to top-p over top-k for that reason, and most major APIs default to top-p sampling.

Does temperature affect API cost?

Temperature itself has no effect on token consumption or cost. Running multiple completions in parallel (the n parameter) does multiply your token cost proportionally. Five parallel completions at temperature 1.1 costs five times as much as one completion — the trade-off for variety in batch generation workflows.

What does presence penalty actually do, and when should I use it?

Presence penalty applies a flat discount to any token that has already appeared in the output, discouraging the model from reusing words or concepts it has already introduced. Use it when you want a single generation run to produce thematically diverse outputs — ideation, brainstorming, multi-angle drafts. Keep it at 0 when consistency and repetition of key terms is desirable, as in structured reports or FAQ answers.

Should I tune temperature in the system prompt or the API parameters?

Both methods work, but API parameters are more reliable and auditable. Some prompt-level instructions ("be creative," "vary your responses") influence sampling behavior implicitly, but they compete with other prompt content and are hard to version-control. Setting temperature in the API call is explicit, reproducible, and easier to log for performance tracking.

Key Takeaways

Temperature shapes the probability distribution over next tokens; it is not a quality dial or a creativity switch.
Top-p and temperature interact — tune one at a time, and avoid maxing both simultaneously.
Precision tasks (legal, medical, structured data) typically perform best at temperature 0.1–0.3; creative and ideation tasks at 0.9–1.2.
Presence penalty is the underused lever for batch diversity — critical for ideation tools, harmful for consistency-requiring tools.
Low temperature reduces probabilistic drift but does not eliminate hallucination or replace human review.
Document every parameter choice with a rationale; undocumented configurations cannot be debugged, scaled, or handed off.
Test at the extremes (0.0, 0.5, 1.0, 1.5) before settling on a final configuration — the output spread reveals task sensitivity better than intuition.

The Situation: Two Use Cases, One Model

The account lead escalated to their AI implementation specialist, who recognized this as a temperature and sampling problem before it became a prompt engineering rabbit hole.

What Temperature and Sampling Actually Control

Why These Parameters Interact

Precision tasks: temperature 0.0–0.3, top-p 0.9–1.0
Balanced tasks: temperature 0.7, top-p 0.9
Creative tasks: temperature 0.9–1.2, top-p 0.95

The Decision Process: Matching Parameters to Task Anatomy

Meridian's implementation specialist built a decision framework before touching any sliders. The core questions:

What is the cost of an unexpected output? For patient health information, unexpected = dangerous. For fundraising concepts, unexpected = valuable.
Does this task have a demonstrably correct answer? FAQ answers do; campaign slogans do not.
Will outputs be human-reviewed before use? Both tools had review steps, but the FAQ tool had a clinician review and the campaign tool had a creative director review — different error tolerances.
Is variety across a batch of outputs desirable or problematic? The FAQ tool needed consistency across identical queries. The ideation tool needed differentiation across a single generation run.

Execution: The FAQ Tool

Configuration

Temperature: 0.1
Top-p: 1.0 (effectively disabled as a constraint given the near-zero temperature)
Frequency penalty: 0.2
Presence penalty: 0.0
Max tokens: 280 (to force concision)

The Failure Mode It Solved

Remaining Limitation

Execution: The Campaign Ideation Tool

Configuration

Temperature: 1.1
Top-p: 0.95
Frequency penalty: 0.0
Presence penalty: 0.4
Max tokens: 600
n (parallel completions): 5

What Changed in Practice

The Cross-Cutting Lesson: Parameters Serve Task Anatomy, Not Vice Versa

A few durable principles the case surfaces:

Document your parameter choices. If you can't articulate why you set temperature to 0.7 rather than 0.5, you can't reproduce it, debug it, or train a team member to maintain it.
Test at the extremes before settling in the middle. Run the same prompt at 0.0, 0.5, 1.0, and 1.5. The distribution of outputs across those four settings tells you more about the task's sensitivity than any rule of thumb.
Presence penalty is underused. Most practitioners adjust temperature and forget presence/frequency penalty entirely. For batch generation tasks — ideation, content variation, brainstorming — presence penalty is often the more important lever.
Low temperature is not a trust substitute. Deterministic outputs still require human review for high-stakes applications. Temperature reduces a class of errors; it does not eliminate the need for judgment.

Frequently Asked Questions

What is the best temperature setting for most business writing tasks?

Can I use temperature 0 for factual accuracy?

How is top-p different from top-k, and which should I use?

Does temperature affect API cost?

What does presence penalty actually do, and when should I use it?

Should I tune temperature in the system prompt or the API parameters?

Key Takeaways

Temperature shapes the probability distribution over next tokens; it is not a quality dial or a creativity switch.
Top-p and temperature interact — tune one at a time, and avoid maxing both simultaneously.
Precision tasks (legal, medical, structured data) typically perform best at temperature 0.1–0.3; creative and ideation tasks at 0.9–1.2.
Presence penalty is the underused lever for batch diversity — critical for ideation tools, harmful for consistency-requiring tools.
Low temperature reduces probabilistic drift but does not eliminate hallucination or replace human review.
Document every parameter choice with a rationale; undocumented configurations cannot be debugged, scaled, or handed off.
Test at the extremes (0.0, 0.5, 1.0, 1.5) before settling on a final configuration — the output spread reveals task sensitivity better than intuition.

Case Study: Model Temperature and Sampling in Practice

The Situation: Two Use Cases, One Model

What Temperature and Sampling Actually Control

Why These Parameters Interact

The Decision Process: Matching Parameters to Task Anatomy

Execution: The FAQ Tool

Configuration

The Failure Mode It Solved

Remaining Limitation

Execution: The Campaign Ideation Tool

Configuration

What Changed in Practice

The Cross-Cutting Lesson: Parameters Serve Task Anatomy, Not Vice Versa

Frequently Asked Questions

What is the best temperature setting for most business writing tasks?

Can I use temperature 0 for factual accuracy?

How is top-p different from top-k, and which should I use?

Does temperature affect API cost?

What does presence penalty actually do, and when should I use it?

Should I tune temperature in the system prompt or the API parameters?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Case Study: Model Temperature and Sampling in Practice

The Situation: Two Use Cases, One Model

What Temperature and Sampling Actually Control

Why These Parameters Interact

The Decision Process: Matching Parameters to Task Anatomy

Execution: The FAQ Tool

Configuration

The Failure Mode It Solved

Remaining Limitation

Execution: The Campaign Ideation Tool

Configuration

What Changed in Practice

The Cross-Cutting Lesson: Parameters Serve Task Anatomy, Not Vice Versa

Frequently Asked Questions

What is the best temperature setting for most business writing tasks?

Can I use temperature 0 for factual accuracy?

How is top-p different from top-k, and which should I use?

Does temperature affect API cost?

What does presence penalty actually do, and when should I use it?

Should I tune temperature in the system prompt or the API parameters?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?