AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Stage One: ReserveWhat It DoesWhy It Comes FirstWhen It Matters MostStage Two: AllocateWhat It DoesWhy It MattersWhen It Matters MostStage Three: ApportionWhat It DoesWhy It MattersWhen It Matters MostStage Four: CompressWhat It DoesWhy It MattersWhen It Matters MostStage Five: EnforceWhat It DoesWhy It MattersWhen It Matters MostRunning RAACE in PracticeRun It Once on Paper FirstRevisit Apportionment as Behavior Teaches YouLet Stages Map to Code BoundariesFrequently Asked QuestionsWhy reserve output before allocating input?How is Allocate different from Apportion?What if a component cannot fit its sub-budget even after compression?Does RAACE apply to non-chat features?Can I skip Enforce if I am careful?Key Takeaways
Home/Blog/The RAACE Model: A Repeatable Way to Budget Tokens
General

The RAACE Model: A Repeatable Way to Budget Tokens

A

Agency Script Editorial

Editorial Team

·August 15, 2022·8 min read
token budget management and optimizationtoken budget management and optimization frameworktoken budget management and optimization guideprompt engineering

Ad hoc token optimization works once, then has to be reinvented for the next feature. A framework gives you a repeatable structure that applies to any prompt, so the thinking transfers instead of starting from scratch each time. This article introduces a named model — RAACE — that organizes token budgeting into five stages: Reserve, Allocate, Apportion, Compress, and Enforce. Each stage has a clear job, and together they take a prompt from an unbounded blob to a deliberate budget.

The value of a named model is not the name. It is that the stages run in a sensible order, each one sets up the next, and skipping a stage produces a predictable kind of failure. Reserve before you allocate or output overflows. Apportion before you compress or you cut the wrong things. Enforce or the whole thing decays. The model encodes that order so you do not have to rediscover it.

RAACE is not tied to any provider or feature type. It applies equally to a chatbot, a retrieval system, a classifier, or a summarizer, because every one of them divides a fixed window among components and pays per token. What changes between them is which stage carries the most weight, which we will cover as we go.

Stage One: Reserve

The first stage sets aside room for the answer before anything else competes for space.

What It Does

Reserve carves out the output budget from the context window first. If a feature needs up to 800 tokens of answer, those tokens are removed from contention before any input is placed. Everything else must fit in the remainder.

Why It Comes First

Output is usually the pricier side and the part most likely to overflow if neglected. Reserving it first prevents the common failure where input fills the window and leaves no room to respond. The deeper rationale is in Spending Tokens Like Money: A Working Manual for LLM Budgets.

When It Matters Most

Reserve carries the most weight in generative features — code assistants, summarizers, long-form writers — where answers are long and output cost dominates.

Stage Two: Allocate

With output reserved, Allocate divides the remaining window into a total input budget.

What It Does

Allocate sets the ceiling for all input combined. It is the boundary the next stages must respect: the sum of system prompt, retrieval, history, and user message cannot exceed it.

Why It Matters

A total input ceiling turns vague intentions into a hard constraint. Without it, individual components negotiate for space implicitly and the loser is usually the answer or coherence. An explicit ceiling makes the trade-offs visible.

When It Matters Most

Allocate is critical wherever input is large and variable — retrieval-heavy systems especially — because that is where the ceiling does the most policing.

Stage Three: Apportion

Apportion splits the input ceiling among the individual components by value.

What It Does

It assigns each input component its own sub-budget: so many tokens for the system prompt, so many for retrieval, so many for history, the rest for the user message. The split reflects how much each component improves answers.

Why It Matters

Apportioning by value ensures the budget flows to the components that earn it. The user message and core instructions are non-negotiable; history and retrieval get the remainder, ranked by contribution. This ranking is the same one emphasized in Token Budget Management and Optimization: Best Practices That Actually Work.

When It Matters Most

Apportion matters most when components compete — multi-turn chat with retrieval, where history and context both want a large share of a fixed input budget.

Stage Four: Compress

Compress fits each component into its sub-budget without losing what the model needs.

What It Does

For each component over its budget, Compress applies the right technique: summarize older history, rerank and trim retrieved passages, prune the system prompt, and prefer structured output. The goal is to hit the sub-budget while preserving meaning.

Why It Matters

Compression is where cost savings and quality risk meet. Done well, it removes redundancy and noise. Done carelessly, it removes context the model needed. The discipline is cutting what does not matter, not merely cutting. The concrete techniques are cataloged in Cut Your Token Costs This Afternoon: An Ordered Routine.

When It Matters Most

Compress carries the most weight in long-running sessions and document-heavy features, where components naturally grow past their budgets.

Stage Five: Enforce

Enforce makes the budget real by putting it in code, where it cannot drift.

What It Does

Enforce moves every limit into configuration and applies it at prompt assembly: truncate, summarize, or reject when a component would exceed its budget, and degrade gracefully when the whole window would overflow.

Why It Matters

A budget that exists only in a design document erodes as features change and developers move on. Enforcement is what keeps the gains from the other four stages from quietly reversing. Its place in a recurring routine is shown in The Token Budget Management and Optimization Checklist for 2026.

When It Matters Most

Enforce matters most in systems with many contributors and long lifespans, where without it the budget decays fastest.

Running RAACE in Practice

Knowing the five stages is one thing; running them on a real feature without losing the thread is another. A few practical notes make the model easier to apply.

Run It Once on Paper First

Before writing any enforcement code, run RAACE on paper for the feature. Reserve a number for output, write down the input ceiling, sketch the apportionment across components, and note which components will need compression to fit. This paper pass takes minutes and surfaces conflicts — a component that cannot possibly fit its share — before you have invested in implementation. The design emerges from the budget rather than the budget being retrofitted to the design.

Revisit Apportionment as Behavior Teaches You

The first apportionment is a hypothesis about which components add the most value. Production teaches you whether it was right. If answers suffer when retrieval is squeezed, retrieval was undervalued and deserves a larger share at history's expense. Treat the split as a starting point you tune against observed quality, not a fixed verdict.

Let Stages Map to Code Boundaries

The five stages map cleanly onto distinct places in code: where you set the output cap, where you compute the input ceiling, where you assign per-component budgets, where you compress each component, and where you enforce the limits. Keeping those concerns separated, rather than tangled into one prompt-building function, makes each stage independently inspectable and tunable. The enforcement boundary in particular benefits from the central configuration discussed in Token Budget Management and Optimization: Best Practices That Actually Work.

Frequently Asked Questions

Why reserve output before allocating input?

Because output is usually the pricier side and the most likely to overflow the window. Reserving it first guarantees room to answer and forces input to fit the remainder, preventing the common no-room-to-respond failure.

How is Allocate different from Apportion?

Allocate sets the total ceiling for all input combined. Apportion divides that ceiling among individual components. One draws the boundary; the other distributes the space inside it.

What if a component cannot fit its sub-budget even after compression?

Either raise its sub-budget by lowering a lower-value component's, or accept a clear degradation like dropping the least relevant context. Apportioning by value tells you which component should yield.

Does RAACE apply to non-chat features?

Yes. Any feature that divides a fixed window among components and pays per token benefits. What changes is which stage carries the most weight — Reserve for generative features, Apportion for competing-component features, and so on.

Can I skip Enforce if I am careful?

Not for long. Careful intentions drift as features evolve and people change. Enforcement in code and configuration is what keeps the budget from quietly decaying back to its old state.

Key Takeaways

  • RAACE structures token budgeting into Reserve, Allocate, Apportion, Compress, and Enforce, run in that order.
  • Reserve output first because output is the pricier side and the most prone to overflowing the window.
  • Allocate sets a total input ceiling; Apportion divides it among components by the value they add.
  • Compress fits each component into its sub-budget while preserving meaning, balancing cost against quality.
  • Enforce moves limits into code and configuration so the gains from earlier stages do not decay.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification