This is a composite case study drawn from a pattern we've seen repeat across many product teams. The names and exact figures are illustrative, but the arc — the slow ramp, the shocking invoice, the diagnosis, and the recovery — is real and common enough to be worth telling in full. If your AI bill is climbing faster than your usage feels like it should, you will recognize this story.
The subject is a mid-sized SaaS company that added an AI-powered assistant to their product. We'll follow them from the comfortable early days through the crisis and out the other side, paying attention to what each decision cost and what each fix saved. The point is not the specific numbers but the reasoning that turned a runaway bill back into a managed line item.
The Situation: A Feature That Worked Too Well
The team shipped an in-app assistant that could answer questions about the user's data, draft content, and walk people through workflows. It launched on the flagship model because that's what the prototype used and the results were excellent. Early on, with a few hundred users trying it out, the monthly bill sat around $1,200. Nobody worried about cost; the feature was a hit and the spend was a rounding error.
Then adoption took off. Over a quarter, usage roughly tripled as the assistant became a core part of the product. And the bill did not triple — it grew far faster, climbing to about $38,000 a month. The finance team flagged it, and the engineering lead was asked a question they couldn't answer: where is all this money going?
The Decision: Diagnose Before Cutting
The instinct under pressure is to slash — turn off the feature, throttle users, downgrade everything blindly. The engineering lead resisted that and instead spent two days instrumenting the system, because you cannot fix what you cannot see. They tagged every model call by feature and logged input tokens, output tokens, and the model used. This mirrors the approach in our Step-by-Step Approach.
The data told a clear story:
- Input tokens dominated. The assistant prepended a large system prompt plus retrieved user data to every single message. The repeated prefix alone was over 4,000 tokens per call.
- Everything ran on the flagship. Simple requests — "summarize this," "what's the status" — used the same expensive model as genuinely hard reasoning.
- Context kept growing. Long conversations carried full history verbatim, so later messages in a session were enormous.
- Output was unbounded. The assistant often wrote long, chatty responses nobody read in full.
Now the spend had a map. The optimization could be targeted instead of panicked.
The Execution: Four Changes in Sequence
The team rolled out fixes one at a time, measuring after each so they could attribute the savings precisely.
Change 1: Prompt caching on the stable prefix
The 4,000-token system prompt and instruction block were identical across calls. Enabling caching billed that repeated content at a steep discount. This was the single biggest win, cutting input cost on the assistant by well over half on its own. It took less than a day of work. The principle is detailed in our Best Practices article.
Change 2: Model routing by task difficulty
They built a lightweight router that sent simple requests — summaries, status lookups, short extractions — to a small model, reserving the flagship for complex reasoning. Roughly 65 percent of traffic moved to the cheaper model with no measurable quality drop, since those tasks never needed the flagship in the first place.
Change 3: Context trimming
Conversation history was summarized after a few turns rather than carried verbatim, and retrieved user data was cut to the most relevant fields. This shrank the average input on long sessions dramatically, attacking the cost that had been compounding most quietly.
Change 4: Output limits
They capped maximum response length and instructed the assistant to be concise. Because output is billed at several times the input rate, trimming verbose responses delivered savings out of proportion to the token count removed.
The Outcome: A Managed Line Item Again
After all four changes, the monthly bill fell from roughly $38,000 to about $9,500 — a reduction of around 75 percent — with no reduction in usage and no user complaints about quality. In fact, the concise outputs and faster small-model responses made the assistant feel snappier.
Just as importantly, the spend was now observable. Per-feature tagging meant the next time a number moved, the team would know why within minutes instead of weeks. The crisis became a permanent capability. For an ongoing version of this discipline, the team adopted our Checklist as a quarterly review.
The Lessons
The takeaways here generalize well beyond this one company. The bill grew faster than usage because the cost drivers — repeated context, flagship-everything, growing history, unbounded output — all scaled with engagement, not just user count. None of the fixes required sacrificing the feature; they required understanding where the money went and matching each lever to its driver. And the whole recovery was possible only because someone instrumented spend before touching anything.
There's a second-order lesson worth naming. The team's real failure wasn't any single technical choice — caching, routing, and trimming are all standard. The failure was the absence of a cost feedback loop. For a full quarter, spend grew with no one watching the trajectory, because the early bill was small enough to ignore and nothing forced anyone to look again. By the time finance raised the alarm, the problem was a crisis instead of a routine adjustment. Had the team logged per-feature spend and set a budget alert at launch, the same four fixes would have been applied gradually, as minor tuning, and the $38,000 invoice would never have existed. The expensive part of this story wasn't the wrong model; it was the blind quarter.
Frequently Asked Questions
Why did the bill grow faster than usage?
Because several cost drivers scaled with engagement on top of user count. More engaged users had longer conversations, which carried more history, which inflated input tokens per call. Combined with flagship-everything and unbounded output, the per-user cost rose as adoption deepened, compounding the headcount growth.
Which single change saved the most?
Prompt caching on the 4,000-token stable prefix. Because that content was identical across every call and billed at full input rate thousands of times a day, caching it at a steep discount produced the largest single drop, and it took less than a day to implement.
Was quality affected by the optimizations?
No, and arguably it improved. The simple requests moved to a small model were tasks the flagship was overqualified for, so quality held. The concise output limits and faster small-model responses made the assistant feel quicker, which users preferred over the previous long, chatty replies.
Why instrument before optimizing instead of just cutting costs immediately?
Because blind cuts risk degrading the feature while missing the real waste. Instrumentation revealed that input tokens and flagship overuse, not user count, drove the bill — a target that panicked throttling would have missed entirely. Two days of measurement made the rest of the work precise and safe.
Could this have been prevented?
Yes. A pre-build cost estimate and per-feature instrumentation from day one would have surfaced the trajectory long before the $38,000 invoice. The team rebuilt around exactly that practice afterward, turning a one-time crisis into a standing safeguard.
Key Takeaways
- AI bills can grow faster than usage when engagement-linked drivers compound.
- Instrument and diagnose before cutting; you cannot fix invisible spend.
- Prompt caching on a large stable prefix was the single biggest saving.
- Routing simple tasks to a small model moved most traffic off the flagship with no quality loss.
- Context trimming and output limits attacked the quietest, fastest-growing costs.
- The four changes together cut the bill roughly 75 percent with no drop in usage.