A mid-sized agency inherited a shared inbox from a client they had just onboarded for managed operations. The inbox held roughly forty thousand unprocessed emails spanning eighteen months: invoices, partnership pitches, customer complaints, internal forwards, and an enormous volume of automated noise. The client wanted it triaged into a handful of actionable categories within two weeks, and they were not willing to fund a labeling project to get there.
This is the story of how the team used zero-shot classification prompting to clear that backlog, what decisions shaped the build, where it nearly went wrong, and what the numbers said at the end. The point is not that the technique is flawless. It is that with no labeled data and a hard deadline, a carefully built zero-shot pipeline was the only credible path, and it held.
The narrative below follows the actual arc: the situation that forced the decision, the choices made under constraint, the execution, the measured result, and the lessons the team carried into later projects.
The Situation and the Constraint
Why labeling was off the table
A traditional supervised classifier would have needed thousands of hand-labeled emails. With one analyst available and a two-week window, labeling enough data to train and validate a model was impossible. The team estimated that even a minimal labeling effort would consume the entire timeline before a single email was sorted.
The categories the client actually cared about
The client did not want fine-grained taxonomy. They wanted five buckets: needs a reply, billing or finance, vendor or partnership, spam or automated, and archive. Crucially, these were operationally distinct, which made them a strong candidate for zero-shot. The team recognized this matched the conditions described in Classifying Support Tickets Without a Single Labeled Example, where clear category boundaries predict success.
The Decision
The lead made an explicit bet: build a zero-shot pipeline, validate it against a small hand-checked sample, and only escalate to few-shot or fine-tuning if accuracy fell below an agreed threshold. The threshold was set before any code was written, which mattered later because it removed the temptation to declare victory based on vibes.
- Target: ninety percent agreement with human judgment on a 300-email audit sample
- Fallback: add few-shot examples for any category scoring below eighty-five percent
- Hard stop: human review for any email the model flagged as low confidence
The Execution
Building the prompt
The prompt described each of the five categories in two sentences, instructed the model to return exactly one label plus a confidence rating, and required a one-line justification. The justification was not stored, but forcing the model to produce it improved the labels, the same effect documented across this cluster's prompt-design work.
Handling scale and cost
Forty thousand emails through a frontier model would have blown the budget. The team used a tiered approach: a small, cheap model handled the obvious spam and automated mail, which was the majority of volume, and only the remainder went to a stronger model. This decision drew directly on the cost reasoning in Defending the Spreadsheet When You Skip the Labeling Budget.
The near-failure
The first audit sample showed the billing category at seventy-eight percent, below the fallback line. Investigation revealed the model was confusing vendor invoices with partnership pitches because both mentioned money and contracts. The fix was not few-shot examples but a sharper category description that explicitly contrasted the two. Accuracy on billing jumped to ninety-one percent on the next audit.
The Outcome
What the numbers showed
Across the full 300-email audit, the pipeline agreed with human reviewers ninety-three percent of the time. The spam and archive categories were near-perfect. The needs-a-reply category, the one the client cared about most, hit ninety-five percent precision, meaning very few real action items were buried.
What it cost and saved
The entire classification run, including the tiered model strategy, cost a small fraction of what a two-analyst labeling-and-review effort would have. The backlog cleared in nine days, ahead of the deadline, and the client adopted the pipeline as a standing intake filter.
The Lessons
Set the success threshold before building
The pre-committed accuracy bar was the single most valuable decision. It turned a subjective judgment into a measurable one and forced the billing fix that a casual review would have missed.
Sharper descriptions beat more examples
The instinct when accuracy lags is to add examples. In this case, the cheaper and faster fix was a better category definition. Examples are a tool, not a reflex, a point reinforced in Deciding Among No Labels, Few Labels, and Fine-Tuning.
Measure the category that matters most
The team did not treat all five categories as equal. The needs-a-reply bucket carried the business risk, because a missed action item meant an ignored customer. They watched its precision specifically and would have held the launch on that number alone even if the overall average looked fine. Weighting your audit toward the category that carries the consequences is a habit worth keeping.
How the Pipeline Was Built in Practice
The tiered routing logic
The first pass ran every email through a small, inexpensive model with a single instruction: is this spam or automated mail, yes or no. Roughly two-thirds of the volume was cleared at this stage for almost nothing. Only the remaining third, the mail that might need human attention, went to the stronger model for the full five-way classification. This two-stage structure is what kept the project inside budget.
Why the cheap first pass was safe
A false negative at the cheap stage, real mail wrongly tagged as spam, would have been the dangerous error. The team guarded against it by making the spam test conservative: only mail the small model was highly confident was automated got filtered, and a sample of the filtered set was audited separately. That audit confirmed the cheap pass was not burying real messages.
- Cheap binary pass clears the bulk of obvious noise
- Conservative threshold prevents real mail being lost
- Separate audit of the filtered set verifies safety
Handling the long tail
A small residue of emails fit none of the five categories cleanly, forwarded chains with mixed content being the worst offenders. Rather than force a label, the prompt allowed an uncertain output that routed these to the analyst. Allowing the model to say it did not know, instead of guessing, was a deliberate design choice that protected the quality of the confident labels.
What the Team Carried Forward
A reusable template
The project produced a reusable pattern the agency applied to later engagements: pre-commit a threshold, build zero-shot first, tier models by difficulty, sharpen descriptions before reaching for examples, and audit the highest-stakes category hardest. None of these steps is exotic, but applying them in order is what separated a controlled build from a hopeful one.
Where they would push further next time
In hindsight, the team would have built the audit harness before the prompt rather than after, so that every prompt change could be measured immediately rather than at a checkpoint. Treating measurement as infrastructure from the first hour, as argued in Reading the Signal When Your Classifier Never Saw Training Data, would have shortened the iteration loop on the billing fix.
Frequently Asked Questions
Could the same result have been achieved with a trained classifier?
Eventually, yes, and with possibly higher ceiling accuracy. But not within the deadline and not without a labeling budget the client refused to fund. Zero-shot was the right tool for the constraint, not necessarily the best tool in the abstract.
How did the team validate without labeled data?
They hand-labeled a 300-email audit sample after the fact, not for training but purely for measurement. This is the minimum viable validation step and it is non-negotiable. A classifier you cannot measure is a liability.
What happened to the low-confidence emails?
They were routed to the single analyst for human review. This was a small fraction of total volume and kept the human in the loop exactly where the model was unsure, which is the correct division of labor.
Did accuracy hold over time?
When the inbox kept receiving new mail, accuracy stayed stable because the categories were durable. The team scheduled a quarterly re-audit to catch any drift in incoming email patterns, which is the maintenance discipline any production classifier needs.
Key Takeaways
- A hard deadline and no labeling budget made zero-shot classification the only credible option, and it cleared a 40,000-email backlog in nine days.
- Pre-committing to a measurable accuracy threshold removed subjectivity and forced a critical fix the team would otherwise have missed.
- A tiered model strategy, cheap models for easy categories and strong models for the rest, kept costs to a small fraction of a manual effort.
- Sharper category descriptions, not more examples, resolved the worst confusion between billing and partnership mail.
- Post-hoc human audit and routing of low-confidence cases to a person are the safeguards that make an untrained classifier trustworthy.