A 200K Window Behaves Nothing Like It Does at Token 5,000

Most people who claim to "understand foundation models" understand the marketing. They can explain that a model is pretrained on a large corpus and then adapted to tasks. That is the tutorial. The practitioner's reality is messier: the same model that aces your hand-picked test cases falls apart on the long tail, context windows that advertise 200K tokens behave very differently at token 5,000 versus token 150,000, and the difference between a system that holds up in production and one that quietly degrades comes down to choices nobody covers in the intro material.

This article is for the people past the basics. We will skip the definitions and go straight to the things that actually bite: attention behavior at scale, the gap between offline evals and lived performance, sampling parameters most teams never touch, and the architectural decisions that determine whether your application is robust or brittle. The goal is depth, not breadth.

Why "Lost in the Middle" Is Still Your Problem

The single most expensive misconception among intermediate users is that a large context window means the model reads everything equally. It does not. Transformer attention has a well-documented positional bias: information at the very start and very end of the context is recalled far more reliably than information buried in the middle. Push a critical instruction into the center of a 100K-token prompt and you should expect it to be effectively invisible a meaningful fraction of the time.

This changes how you structure prompts. Practical consequences:

Put the instruction you care about most at the very top, and restate the non-negotiable constraints at the very bottom.
Treat the middle of a long context as a place for reference material the model can consult, not for commands it must obey.
When you retrieve documents, order matters. Ranking the most relevant chunk into the first or last slot beats dumping ten chunks in arbitrary order.

If you are building retrieval pipelines, this is the difference between a system that answers correctly and one that hallucinates because the right passage was technically present but functionally ignored. The fundamentals are covered in The Complete Guide to Foundation Models; what we are adding here is that "it's in the context" and "the model used it" are two different claims.

Sampling: The Knobs Nobody Turns

Temperature gets all the attention, and it is the least interesting knob. The parameters that change behavior in ways people do not expect are top-p (nucleus sampling), top-k, and repetition or frequency penalties.

When low temperature is the wrong call

Teams default temperature to near zero for "accuracy" and then complain the model is repetitive or brittle. For extraction and classification, low temperature is correct. For anything requiring the model to consider alternatives — debugging, brainstorming, generating diverse candidates you will rank later — collapsing to greedy decoding throws away the model's actual strength. A common pattern is to generate several candidates at moderate temperature and select among them, rather than demanding the one true answer at temperature zero.

Top-p versus top-k

Top-k caps the candidate pool at a fixed number of tokens; top-p caps it by cumulative probability mass. Top-p adapts to the model's confidence: when the model is certain, the pool shrinks; when it is unsure, the pool widens. For most production work, top-p in the 0.9 to 0.95 range with a modest temperature gives more stable output than aggressively tuning temperature alone. The failure mode of ignoring these is subtle — you get output that is fine on average and occasionally veers into nonsense on the inputs where the model was uncertain.

Your Evals Are Lying to You

Here is the uncomfortable truth about intermediate AI work: most teams measure the wrong thing. They assemble 20 to 50 test cases, run them, see green, and ship. Then production hands them inputs that look nothing like the test set.

The leakage and overfitting trap

If you tune prompts against the same examples you evaluate on, you are overfitting to those examples exactly the way an over-trained model overfits to training data. Keep a held-out set you never inspect during prompt iteration. When the held-out numbers diverge from your development numbers, you have learned something real about generalization.

Measure distributions, not points

A single accuracy number hides the tail. A model that is 95% accurate on average can still fail catastrophically on a specific input category — say, anything involving negation, or non-English names, or dates before the year 2000. Slice your evals by input type and look at the worst-performing slice, because that slice is what generates support tickets. For a structured approach to avoiding these traps, 7 Common Mistakes with Foundation Models (and How to Avoid Them) catalogs the failure patterns that show up repeatedly.

Fine-Tuning, RAG, and Prompting Are Not Interchangeable

A frequent expert-level mistake is reaching for fine-tuning to solve a knowledge problem. Fine-tuning teaches a model a behavior or a style; it is poor at injecting facts. If the issue is "the model does not know our internal product catalog," fine-tuning will make it confidently wrong, because it learns the shape of answers without reliably learning the contents.

The decision tree that actually holds up:

The model lacks current or proprietary facts → retrieval (RAG). Keep the facts external and fetch them at query time.
The model does not follow your format or tone → prompting first, then few-shot examples, then fine-tuning only if those plateau.
You need lower latency or cost at a fixed behavior → fine-tune a smaller model to match a larger model's outputs (distillation), which is one of the few cases where fine-tuning earns its keep.

Mixing these up is the most common reason advanced projects stall. The A Framework for Foundation Models walks through the selection logic in more detail.

Structured Output and the Reliability Tax

Getting a model to return valid JSON every time is an underrated engineering challenge. Free-form generation will occasionally emit a trailing comma, a markdown code fence, or a hallucinated field. At scale, "occasionally" means daily.

Robust approaches, in rough order of reliability:

Use the provider's native structured-output or function-calling mode when available; it constrains decoding to valid schemas rather than asking the model to behave.
If you must parse free text, validate against a schema and retry with the validation error fed back into the prompt. One retry catches the large majority of malformations.
Never regex-extract from prose when a structured mode exists. The regex will work in testing and break on the first input that phrases things differently.

Latency, Cost, and the Architecture That Survives Contact

Advanced systems are judged on tail latency and unit economics, not best-case demos. A few decisions that compound:

Cache aggressively. Prompt caching on stable system prompts and few-shot blocks can cut both cost and latency substantially when the static portion of your prompt is large.
Route by difficulty. Send easy inputs to a small fast model and escalate only the hard ones to a large model. A classifier or a confidence check up front pays for itself.
Stream and degrade gracefully. Stream tokens to keep perceived latency low, and define a fallback path for when the primary model is slow or unavailable rather than letting requests hang.

These are the patterns that separate a system you can put real traffic through from a prototype. The Foundation Models: Best Practices That Actually Work collects more of them.

Frequently Asked Questions

Does a bigger context window remove the need for retrieval?

No. Larger windows reduce how often you need to chunk, but the lost-in-the-middle effect and the cost of processing huge contexts on every call still favor retrieving only what is relevant. Treat a big context as headroom, not as a replacement for good retrieval.

When is fine-tuning actually worth it?

When you have a stable, well-defined behavior, hundreds to thousands of clean examples, and a cost or latency target that prompting cannot meet. Distilling a large model's behavior into a smaller one is the strongest case. Fine-tuning to teach facts is almost always the wrong tool.

Why does my model pass evals but fail in production?

Almost always because your eval set does not reflect production input distribution, or you overfit prompts to the eval cases. Keep a held-out set, slice metrics by input type, and watch the worst slice rather than the average.

How do I get reliable JSON out of a model?

Use the provider's native structured-output mode, validate the result against a schema, and retry with the error message on failure. Avoid parsing free-form prose when a constrained mode exists.

What temperature should I use for production?

It depends on the task. Near zero for extraction and classification, moderate (0.5 to 0.8) with top-p around 0.9 for tasks that benefit from considering alternatives. The instinct to set everything to zero for "accuracy" often hurts quality on uncertain inputs.

Key Takeaways

Large context windows have positional bias; place critical instructions at the start and end, not the middle.
Top-p and candidate generation matter more than obsessively tuning temperature alone.
Your evals are probably lying because of leakage and averaging; use held-out sets and slice by input type.
Match the tool to the problem: retrieval for facts, prompting and few-shot for behavior, fine-tuning mainly for distillation.
Use native structured output and schema validation instead of regex parsing.
Production-grade systems win on caching, difficulty routing, and graceful degradation, not on demo-day accuracy.

Why "Lost in the Middle" Is Still Your Problem

This changes how you structure prompts. Practical consequences:

Put the instruction you care about most at the very top, and restate the non-negotiable constraints at the very bottom.
Treat the middle of a long context as a place for reference material the model can consult, not for commands it must obey.
When you retrieve documents, order matters. Ranking the most relevant chunk into the first or last slot beats dumping ten chunks in arbitrary order.

Sampling: The Knobs Nobody Turns

When low temperature is the wrong call

Top-p versus top-k

Your Evals Are Lying to You

The leakage and overfitting trap

Measure distributions, not points

Fine-Tuning, RAG, and Prompting Are Not Interchangeable

The decision tree that actually holds up:

The model lacks current or proprietary facts → retrieval (RAG). Keep the facts external and fetch them at query time.
The model does not follow your format or tone → prompting first, then few-shot examples, then fine-tuning only if those plateau.
You need lower latency or cost at a fixed behavior → fine-tune a smaller model to match a larger model's outputs (distillation), which is one of the few cases where fine-tuning earns its keep.

Mixing these up is the most common reason advanced projects stall. The A Framework for Foundation Models walks through the selection logic in more detail.

Structured Output and the Reliability Tax

Robust approaches, in rough order of reliability:

Use the provider's native structured-output or function-calling mode when available; it constrains decoding to valid schemas rather than asking the model to behave.
If you must parse free text, validate against a schema and retry with the validation error fed back into the prompt. One retry catches the large majority of malformations.
Never regex-extract from prose when a structured mode exists. The regex will work in testing and break on the first input that phrases things differently.

Latency, Cost, and the Architecture That Survives Contact

Advanced systems are judged on tail latency and unit economics, not best-case demos. A few decisions that compound:

Cache aggressively. Prompt caching on stable system prompts and few-shot blocks can cut both cost and latency substantially when the static portion of your prompt is large.
Route by difficulty. Send easy inputs to a small fast model and escalate only the hard ones to a large model. A classifier or a confidence check up front pays for itself.
Stream and degrade gracefully. Stream tokens to keep perceived latency low, and define a fallback path for when the primary model is slow or unavailable rather than letting requests hang.

These are the patterns that separate a system you can put real traffic through from a prototype. The Foundation Models: Best Practices That Actually Work collects more of them.

Frequently Asked Questions

Does a bigger context window remove the need for retrieval?

When is fine-tuning actually worth it?

Why does my model pass evals but fail in production?

How do I get reliable JSON out of a model?

Use the provider's native structured-output mode, validate the result against a schema, and retry with the error message on failure. Avoid parsing free-form prose when a constrained mode exists.

What temperature should I use for production?

Key Takeaways

Large context windows have positional bias; place critical instructions at the start and end, not the middle.
Top-p and candidate generation matter more than obsessively tuning temperature alone.
Your evals are probably lying because of leakage and averaging; use held-out sets and slice by input type.
Match the tool to the problem: retrieval for facts, prompting and few-shot for behavior, fine-tuning mainly for distillation.
Use native structured output and schema validation instead of regex parsing.
Production-grade systems win on caching, difficulty routing, and graceful degradation, not on demo-day accuracy.

A 200K Window Behaves Nothing Like It Does at Token 5,000

Why "Lost in the Middle" Is Still Your Problem

Sampling: The Knobs Nobody Turns

When low temperature is the wrong call

Top-p versus top-k

Your Evals Are Lying to You

The leakage and overfitting trap

Measure distributions, not points

Fine-Tuning, RAG, and Prompting Are Not Interchangeable

Structured Output and the Reliability Tax

Latency, Cost, and the Architecture That Survives Contact

Frequently Asked Questions

Does a bigger context window remove the need for retrieval?

When is fine-tuning actually worth it?

Why does my model pass evals but fail in production?

How do I get reliable JSON out of a model?

What temperature should I use for production?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

A 200K Window Behaves Nothing Like It Does at Token 5,000

Why "Lost in the Middle" Is Still Your Problem

Sampling: The Knobs Nobody Turns

When low temperature is the wrong call

Top-p versus top-k

Your Evals Are Lying to You

The leakage and overfitting trap

Measure distributions, not points

Fine-Tuning, RAG, and Prompting Are Not Interchangeable

Structured Output and the Reliability Tax

Latency, Cost, and the Architecture That Survives Contact

Frequently Asked Questions

Does a bigger context window remove the need for retrieval?

When is fine-tuning actually worth it?

Why does my model pass evals but fail in production?

How do I get reliable JSON out of a model?

What temperature should I use for production?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?