Count Your Tokens, Find the Waste, Save Today

Most guides to context length open with retrieval architectures and embedding databases. That is the right destination for some systems, but it is the wrong starting point for almost everyone. The fastest credible first result is much simpler: count what you are actually sending the model, find the waste, and cut it. You can do that with no new infrastructure and see savings the same day.

This guide assumes you have an AI feature that sends text to a model and you have never deliberately managed how much. That is the normal starting state. We will go from there to a first measurable improvement, list the prerequisites honestly, and point at where to go once you have squeezed the easy wins.

What You Need Before You Start

The prerequisites are modest, which is the point.

A way to log prompts. You need to capture the actual text sent to the model, or at least its token count, per call. If you cannot see what you send, you cannot manage it.
A token counter. Every major model provider offers a way to count tokens for a given string. You need this to turn "a big prompt" into a number.
A handful of real examples. Ten to twenty representative prompts from actual usage. Synthetic examples lie; real ones reveal the bloat.

That is the whole list. You do not need a vector database, a RAG framework, or an eval harness to get your first result. Those come later if the audit shows you need them.

Step One: Audit What You Send

Before optimizing, measure. Take your real prompt examples and break each one into its parts.

Tag every prompt by source

Split each prompt into system instructions, retrieved or pasted context, conversation history, and the user's actual input. Count tokens for each part. Almost every team that does this for the first time finds one part is shockingly large, usually a verbose system prompt or unbounded history.

Find the waste

Look for three things specifically:

Instructions repeated in every call that could be stated once.
Context that is pasted in full when only a fraction is relevant.
Conversation history that grows without limit because nothing prunes it.

This audit alone often reveals 30 to 50 percent of your tokens are doing no work. That is your first result, and you have not changed any code yet.

Step Two: Cut the Obvious Waste

Now act on the audit. Start with the changes that carry no accuracy risk.

Trim the system prompt. Remove redundancy and examples that do not change behavior. Test against your real examples to confirm output quality holds.
Cap conversation history. Keep the most recent and most relevant turns rather than the entire thread. A simple recency window is a fine first version.
Stop pasting full documents when the user only asks about a section. Even crude truncation to the relevant part beats sending everything.

Make one change at a time and re-run your examples. If output quality holds, keep the change. This is a tight, low-risk loop that produces real savings fast. The common mistakes article catalogs the traps people hit at this stage, and reading it first will save you a few.

Step Three: Decide If You Need More

After the audit and the easy cuts, you have a much leaner system and a clear picture. Now decide whether to stop or keep going.

If your relevant content is small and stable, you may be done. Do not build retrieval for a problem you have just solved by trimming.
If you are still sending large context because the relevant content lives in a big corpus, that is the signal to move to retrieval. The trade-offs article lays out the decision rule for that step.
If accuracy matters enough to need confidence, build a small eval set next so future changes are measured rather than guessed.

The discipline here is not jumping to retrieval prematurely. The audit-and-trim pass solves the problem outright for a surprising number of teams, and it costs an afternoon instead of a sprint.

A Worked Example to Anchor the Process

Abstract steps are easy to nod along to and hard to act on, so here is a concrete shape of what the first pass looks like in practice.

Imagine a support assistant that answers product questions. Its prompt, when you finally log and count it, breaks down like this: a 3,000-token system prompt full of examples accumulated over months, the entire 40-turn conversation history, a full 12,000-token help article pasted in regardless of the question, and a 40-token user message. The total is large, the per-call cost is high, and nobody chose any of it deliberately. It grew.

The audit makes the waste obvious. The system prompt has six examples where two would do. The conversation history includes turns from twenty minutes ago that no longer matter. The help article is pasted whole even when the user asked about one feature described in two paragraphs. None of this is exotic; it is the default state of a feature that was shipped to work, not to be efficient.

The trim is mechanical once you see it. Cut the system prompt to the two examples that actually change behavior. Cap history to the last several relevant turns. Send the relevant section of the article instead of the whole thing. Each change gets verified against your real examples, and if output holds, it stays. The prompt that was dominated by unused tokens becomes a lean prompt that does the same job, and the savings are immediate and recurring. This is the entire value proposition of getting started: most of the win is removing what should never have been there.

What Not to Do First

Beginners often sabotage the easy win by reaching for sophistication too early.

Do not build embeddings infrastructure before auditing. You may discover the problem is a bloated system prompt, which embeddings do nothing to fix.
Do not optimize a low-traffic feature as your first project. Pick something with real volume so the savings are worth the effort and visible to others.
Do not change five things at once. You will not know which change helped or hurt. One change, one check, every time.

Restraint is the skill here. The fastest path to a real result is the boring one, and the boring one works.

Frequently Asked Questions

Do I need a vector database to get started?

No. The first result comes from auditing and trimming what you already send, which needs no new infrastructure. Build retrieval only after the audit shows your relevant content genuinely lives in a corpus too large to send directly.

How do I count tokens?

Every major model provider offers a tokenizer or token-counting utility for a given string. Use the one matching your model, since token counts differ slightly between models. This turns vague prompt size into a number you can manage.

What is the fastest improvement I can make today?

Audit your prompt by source, then trim the largest non-essential part, usually a verbose system prompt or unbounded conversation history. Many teams find a third or more of their tokens are doing no work, and removing them is risk-free if output quality holds on your test examples.

How do I know if trimming hurt accuracy?

Run your handful of real examples before and after each change and compare outputs. For a first pass this informal check is enough; if accuracy is critical, build a small frozen eval set so the comparison is rigorous rather than impressionistic.

When should I move from trimming to retrieval?

When you are still forced to send large context because the relevant information lives in a corpus larger than any prompt can hold. If trimming already got your prompts small and stable, you do not need retrieval yet.

Key Takeaways

Start with a token audit, not a retrieval pipeline. It is faster and often sufficient.
Prerequisites are minimal: prompt logging, a token counter, and a handful of real examples.
Tag prompts by source to find waste; verbose system prompts and unbounded history are the usual culprits.
Cut the obvious waste one change at a time, verifying output quality against real examples.
Move to retrieval only when trimming cannot shrink your context because the content lives in a large corpus.

What You Need Before You Start

The prerequisites are modest, which is the point.

A way to log prompts. You need to capture the actual text sent to the model, or at least its token count, per call. If you cannot see what you send, you cannot manage it.
A token counter. Every major model provider offers a way to count tokens for a given string. You need this to turn "a big prompt" into a number.
A handful of real examples. Ten to twenty representative prompts from actual usage. Synthetic examples lie; real ones reveal the bloat.

That is the whole list. You do not need a vector database, a RAG framework, or an eval harness to get your first result. Those come later if the audit shows you need them.

Step One: Audit What You Send

Before optimizing, measure. Take your real prompt examples and break each one into its parts.

Tag every prompt by source

Find the waste

Look for three things specifically:

Instructions repeated in every call that could be stated once.
Context that is pasted in full when only a fraction is relevant.
Conversation history that grows without limit because nothing prunes it.

This audit alone often reveals 30 to 50 percent of your tokens are doing no work. That is your first result, and you have not changed any code yet.

Step Two: Cut the Obvious Waste

Now act on the audit. Start with the changes that carry no accuracy risk.

Trim the system prompt. Remove redundancy and examples that do not change behavior. Test against your real examples to confirm output quality holds.
Cap conversation history. Keep the most recent and most relevant turns rather than the entire thread. A simple recency window is a fine first version.
Stop pasting full documents when the user only asks about a section. Even crude truncation to the relevant part beats sending everything.

Step Three: Decide If You Need More

After the audit and the easy cuts, you have a much leaner system and a clear picture. Now decide whether to stop or keep going.

If your relevant content is small and stable, you may be done. Do not build retrieval for a problem you have just solved by trimming.
If you are still sending large context because the relevant content lives in a big corpus, that is the signal to move to retrieval. The trade-offs article lays out the decision rule for that step.
If accuracy matters enough to need confidence, build a small eval set next so future changes are measured rather than guessed.

The discipline here is not jumping to retrieval prematurely. The audit-and-trim pass solves the problem outright for a surprising number of teams, and it costs an afternoon instead of a sprint.

A Worked Example to Anchor the Process

Abstract steps are easy to nod along to and hard to act on, so here is a concrete shape of what the first pass looks like in practice.

What Not to Do First

Beginners often sabotage the easy win by reaching for sophistication too early.

Do not build embeddings infrastructure before auditing. You may discover the problem is a bloated system prompt, which embeddings do nothing to fix.
Do not optimize a low-traffic feature as your first project. Pick something with real volume so the savings are worth the effort and visible to others.
Do not change five things at once. You will not know which change helped or hurt. One change, one check, every time.

Restraint is the skill here. The fastest path to a real result is the boring one, and the boring one works.

Frequently Asked Questions

Do I need a vector database to get started?

How do I count tokens?

What is the fastest improvement I can make today?

How do I know if trimming hurt accuracy?

When should I move from trimming to retrieval?

Key Takeaways

Start with a token audit, not a retrieval pipeline. It is faster and often sufficient.
Prerequisites are minimal: prompt logging, a token counter, and a handful of real examples.
Tag prompts by source to find waste; verbose system prompts and unbounded history are the usual culprits.
Cut the obvious waste one change at a time, verifying output quality against real examples.
Move to retrieval only when trimming cannot shrink your context because the content lives in a large corpus.

Count Your Tokens, Find the Waste, Save Today

What You Need Before You Start

Step One: Audit What You Send

Tag every prompt by source

Find the waste

Step Two: Cut the Obvious Waste

Step Three: Decide If You Need More

A Worked Example to Anchor the Process

What Not to Do First

Frequently Asked Questions

Do I need a vector database to get started?

How do I count tokens?

What is the fastest improvement I can make today?

How do I know if trimming hurt accuracy?

When should I move from trimming to retrieval?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Count Your Tokens, Find the Waste, Save Today

What You Need Before You Start

Step One: Audit What You Send

Tag every prompt by source

Find the waste

Step Two: Cut the Obvious Waste

Step Three: Decide If You Need More

A Worked Example to Anchor the Process

What Not to Do First

Frequently Asked Questions

Do I need a vector database to get started?

How do I count tokens?

What is the fastest improvement I can make today?

How do I know if trimming hurt accuracy?

When should I move from trimming to retrieval?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?