Shrink a Prompt in Six Measured Steps You Can Run Today

Knowing the theory of prompt compression and actually shrinking a prompt are different skills. This is the second one. What follows is a sequential process you can run today on a prompt you already have in production, with each step defined concretely enough to follow without interpretation. No abstractions about token efficiency—just do this, then that, and measure as you go.

The process is deliberately conservative. It compresses one thing at a time and checks quality after every change, because the fastest way to ruin a working prompt is to cut several things at once and lose track of which cut did the damage. Follow the steps in order. Each builds on the artifact the previous step produced.

You will need a prompt to compress and a small set of representative inputs to test it against. Gather those before step one. The representative inputs matter more than people expect: they are the only thing standing between a safe cut and a silent regression, so spend a few minutes choosing inputs that cover both your common cases and the awkward edge cases where hidden constraints live. A test set that only contains easy inputs will bless cuts that quietly break the hard ones.

Step 1: Build the Baseline

You cannot tell whether compression hurt quality without knowing what good looked like beforehand.

Do this

Pick five to ten inputs that represent the real range of what the prompt handles.
Run the current prompt on each and save the outputs.
Note the token count of the prompt and roughly how good each output is.

This baseline is the reference every later step compares against. Skipping it means flying blind, which is the root of most compression failures.

Step 2: Find the Fat

Before cutting, locate where the tokens actually are.

Do this

Read the prompt and mark each section by purpose: instructions, examples, context, background.
Estimate the token weight of each section.
Flag anything repeated, anything that reads as filler, and any context that may be irrelevant.

The largest sections that carry the least task-specific information are your first targets. Often a long system preamble or a padded context block dominates the token count, and those are where the easy wins hide.

Step 3: Make One Cut

Now compress, but only one thing.

Do this

Choose a single flagged section.
Apply one technique: remove filler, tighten instructions into bullets, or drop an irrelevant passage.
Leave everything else untouched.

Resisting the urge to fix five things at once is the discipline that makes this process work. One change means you can attribute any quality shift to exactly that change. This is the same single-variable rule explained in Saying More to a Model With Fewer Tokens.

Step 4: Re-Measure Against the Baseline

This is the step that turns guessing into knowing.

Do this

Run the compressed prompt on the same five-to-ten inputs from step one.
Compare each output to its baseline counterpart.
Record the new token count and whether quality held, improved, or dropped.

If quality held or improved, keep the cut—you compressed. If it dropped, the section you cut carried signal, not filler. Revert it and try a different section. This compare-and-keep loop is what separates compression from accidental deletion, the failure cataloged in 7 Common Mistakes with Prompt Compression Techniques (and How to Avoid Them).

Step 5: Repeat Until the Returns Shrink

One cut is rarely the whole opportunity.

Do this

Return to step three and compress the next flagged section.
Re-measure each time.
Stop when remaining cuts either threaten quality or save too few tokens to matter.

Compression has diminishing returns. The first two or three cuts usually reclaim most of the available tokens; chasing the last few percent often risks more quality than it saves. Knowing when to stop is part of the skill.

Step 6: Lock and Document the Result

A compressed prompt that nobody records will drift back to bloat.

Do this

Save the final prompt in version control, not in a chat history.
Note the total tokens saved and confirm the quality baseline still holds.
Write down which sections you compressed and which you deliberately left alone.

Documenting what you left alone matters as much as what you cut, because it tells the next person which sections are load-bearing. A one-line note saying "this looks verbose but is required to trigger escalation" can save a future editor from re-introducing a regression you already found and reverted. For a sense of how much real prompts compress, walk through Case Study: Prompt Compression Techniques in Practice.

One last habit worth building: schedule a re-check after any model upgrade. A compression validated against today's model is not guaranteed safe against tomorrow's, because an update can change which tersely-phrased instructions the model still follows. Re-running your baseline test set after an upgrade is cheap insurance against a prompt that was lean and correct silently becoming lean and wrong.

A Worked Pass Through the Steps

To make the process concrete, here is what a single pass looks like on a realistic prompt, so you can picture each step before running your own.

The starting point

Imagine a system prompt with four parts: a polite preamble, a long block of tone guidance, a list of task rules, and a set of three worked examples. It runs on every request, so every token counts repeatedly.

Walking the steps

Baseline (Step 1): You run the prompt on eight representative inputs and save the outputs. The prompt is, say, 600 tokens.
Find the fat (Step 2): You mark the preamble as pure filler, the tone block as verbose but possibly load-bearing, the rules as essential, and the three examples as more than the task needs.
First cut (Step 3): You delete the preamble only, leaving everything else alone.
Re-measure (Step 4): The eight outputs are unchanged in quality, and the prompt is now smaller. You keep the cut.
Repeat (Step 5): Next pass, you reduce three examples to one. Quality still holds. The pass after, you tighten the tone block into bullets—and one output gets slightly worse, so you revert that part and keep only the safe portion of the tightening.
Lock (Step 6): You stop when remaining cuts threaten quality, save the result in version control, and note that the rules and the surviving example are load-bearing.

The point of walking through it is to show that the process is unglamorous on purpose. Each step is small, each result is measured, and the safety comes from never changing more than one thing between measurements. That is the entire trick—there is no clever shortcut that beats measuring.

Frequently Asked Questions

How many test inputs do I really need?

Five to ten that genuinely represent the range of real usage. The goal is not statistical rigor but enough coverage to notice if a cut breaks a common case. Too few and you miss regressions; far more and the loop gets slow without adding much confidence.

Why only one cut at a time?

Because if you change several things and quality drops, you cannot tell which change caused it. One cut per measurement keeps every result attributable, so you keep the good cuts and revert only the harmful one instead of throwing away the whole batch.

When should I stop compressing?

When the next cut either threatens quality or saves too few tokens to justify the risk. The first few cuts usually capture most of the savings, and chasing the last percent tends to cost more in quality than it returns in tokens.

What if every section seems load-bearing?

Then the prompt may already be efficient, which is a fine outcome. More often, the load-bearing sections can still be tightened in wording—turned into bullets, stripped of filler—without removing any actual requirement. Tighten the phrasing before concluding there is nothing to cut.

Key Takeaways

Build a quality baseline on representative inputs before changing anything—it is the reference for every later step.
Locate the fat by mapping each section's purpose and token weight before cutting.
Make one cut at a time so any quality change is attributable to that single change.
Re-measure after each cut and keep it only if quality holds; revert if it drops.
Stop when returns shrink, then lock and document the result in version control, including what you left alone.

Step 1: Build the Baseline

You cannot tell whether compression hurt quality without knowing what good looked like beforehand.

Do this

Pick five to ten inputs that represent the real range of what the prompt handles.
Run the current prompt on each and save the outputs.
Note the token count of the prompt and roughly how good each output is.

This baseline is the reference every later step compares against. Skipping it means flying blind, which is the root of most compression failures.

Step 2: Find the Fat

Before cutting, locate where the tokens actually are.

Do this

Read the prompt and mark each section by purpose: instructions, examples, context, background.
Estimate the token weight of each section.
Flag anything repeated, anything that reads as filler, and any context that may be irrelevant.

Step 3: Make One Cut

Now compress, but only one thing.

Do this

Choose a single flagged section.
Apply one technique: remove filler, tighten instructions into bullets, or drop an irrelevant passage.
Leave everything else untouched.

Step 4: Re-Measure Against the Baseline

This is the step that turns guessing into knowing.

Do this

Run the compressed prompt on the same five-to-ten inputs from step one.
Compare each output to its baseline counterpart.
Record the new token count and whether quality held, improved, or dropped.

Step 5: Repeat Until the Returns Shrink

One cut is rarely the whole opportunity.

Do this

Return to step three and compress the next flagged section.
Re-measure each time.
Stop when remaining cuts either threaten quality or save too few tokens to matter.

Step 6: Lock and Document the Result

A compressed prompt that nobody records will drift back to bloat.

Do this

Save the final prompt in version control, not in a chat history.
Note the total tokens saved and confirm the quality baseline still holds.
Write down which sections you compressed and which you deliberately left alone.

A Worked Pass Through the Steps

To make the process concrete, here is what a single pass looks like on a realistic prompt, so you can picture each step before running your own.

The starting point

Walking the steps

Baseline (Step 1): You run the prompt on eight representative inputs and save the outputs. The prompt is, say, 600 tokens.
Find the fat (Step 2): You mark the preamble as pure filler, the tone block as verbose but possibly load-bearing, the rules as essential, and the three examples as more than the task needs.
First cut (Step 3): You delete the preamble only, leaving everything else alone.
Re-measure (Step 4): The eight outputs are unchanged in quality, and the prompt is now smaller. You keep the cut.
Repeat (Step 5): Next pass, you reduce three examples to one. Quality still holds. The pass after, you tighten the tone block into bullets—and one output gets slightly worse, so you revert that part and keep only the safe portion of the tightening.
Lock (Step 6): You stop when remaining cuts threaten quality, save the result in version control, and note that the rules and the surviving example are load-bearing.

Frequently Asked Questions

How many test inputs do I really need?

Why only one cut at a time?

When should I stop compressing?

What if every section seems load-bearing?

Key Takeaways

Build a quality baseline on representative inputs before changing anything—it is the reference for every later step.
Locate the fat by mapping each section's purpose and token weight before cutting.
Make one cut at a time so any quality change is attributable to that single change.
Re-measure after each cut and keep it only if quality holds; revert if it drops.
Stop when returns shrink, then lock and document the result in version control, including what you left alone.

Shrink a Prompt in Six Measured Steps You Can Run Today

Step 1: Build the Baseline

Do this

Step 2: Find the Fat

Do this

Step 3: Make One Cut

Do this

Step 4: Re-Measure Against the Baseline

Do this

Step 5: Repeat Until the Returns Shrink

Do this

Step 6: Lock and Document the Result

Do this

A Worked Pass Through the Steps

The starting point

Walking the steps

Frequently Asked Questions

How many test inputs do I really need?

Why only one cut at a time?

When should I stop compressing?

What if every section seems load-bearing?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Shrink a Prompt in Six Measured Steps You Can Run Today

Step 1: Build the Baseline

Do this

Step 2: Find the Fat

Do this

Step 3: Make One Cut

Do this

Step 4: Re-Measure Against the Baseline

Do this

Step 5: Repeat Until the Returns Shrink

Do this

Step 6: Lock and Document the Result

Do this

A Worked Pass Through the Steps

The starting point

Walking the steps

Frequently Asked Questions

How many test inputs do I really need?

Why only one cut at a time?

When should I stop compressing?

What if every section seems load-bearing?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?