There is a difference between understanding zero-shot classification and actually building one that works. This article is about the second thing. It is a step-by-step procedure you can follow start to finish, in order, to produce a classifier that sorts your text into your categories with measurable accuracy. No theory dumps, no detours — just do this, then this, then this.
The procedure assumes you have a classification task in mind: a pile of text and a set of categories you want each item sorted into. Support tickets into types, feedback into themes, documents into topics — the steps are the same regardless. Each step has a concrete action and a way to tell whether you did it right before moving on.
Work through them in sequence. Skipping ahead is the most common reason classifiers come out unreliable, because each step removes a source of error that the next step depends on.
Step 1: Write Down Your Categories
Before touching a model, list the categories on paper. This forces clarity you will otherwise skip.
The Action
Write each category name followed by a one-sentence definition of what belongs in it. Add an "other" category for text that fits nowhere. If you cannot define a category in one clear sentence, it is too fuzzy and needs splitting or merging.
- List every category name
- Write a one-line definition for each
- Add an explicit "other" or "none" category
The Check
Read your definitions and ask whether any two overlap. If "complaint" and "negative feedback" could both apply to the same text, redefine them until they are distinct. Distinct categories are the precondition for everything that follows, as explained in the from-scratch introduction to zero-shot classification.
Step 2: Draft the Prompt
Now turn the categories into an instruction the model can follow.
The Action
Write a prompt with four parts in order: the task ("Classify the following text into exactly one category"), the labeled list with definitions, a placeholder for the input, and a strict output instruction ("Respond with only the category name").
- State the task plainly
- Include the labels with their definitions
- End with a tight output format rule
The Check
Read the prompt as if you were the model. Is it obvious what to do, what the options are, and how to answer? If anything is ambiguous, tighten it now. Ambiguity here becomes errors later.
Step 3: Test on a Handful of Inputs
Do not classify everything yet. Run a small batch first.
The Action
Pick five to ten varied inputs, including at least one you expect to be tricky. Run each through the prompt and look at the answers.
- Choose inputs that span your categories
- Include a deliberately ambiguous case
- Read every output, not just the count
The Check
Did the model return only the label, in the expected format? Did the obvious cases come out right? If the format is off, fix the output instruction. If easy cases are wrong, your definitions need work. This early check catches most problems cheaply, before they scale.
Step 4: Constrain and Clean the Output
Make the output reliably machine-readable so you can use it at scale.
The Action
If you saw any stray explanations or formatting variation, tighten the output rule — specify the exact label spelling and that nothing else should appear. For programmatic use, ask for structured output like a JSON field with the label.
- Pin the exact allowed label values
- Forbid commentary or hedging
- Use structured output for automated pipelines
The Check
Run the small batch again and confirm every output is a clean, parseable label from your list. Unconstrained output is a top failure mode, covered in depth in Eight Quiet Ways Zero-Shot Classifiers Go Wrong.
Step 5: Build a Validation Set
You cannot claim the classifier works until you have measured it. This step creates the measuring stick.
The Action
Hand-label a few hundred representative inputs with the correct category yourself. This is your ground truth. Spread it across all categories so each one is tested.
- Hand-label a few hundred varied inputs
- Cover every category, including "other"
- Keep this set fixed so results are comparable over time
The Check
Confirm your set includes examples of every category and some genuinely hard cases. A validation set that only contains easy inputs will overstate your accuracy.
Step 6: Measure and Fix
Run the classifier against the validation set and act on what you find.
The Action
Classify the whole validation set and compare to your labels. Compute accuracy per category, not just overall. Look at which categories get confused with which others.
- Compute per-category accuracy
- Examine the specific confusions
- Tighten the definitions of confused categories and re-run
The Check
Are the per-category numbers acceptable for your use? If one category is weak, sharpen its definition or add a clarifying example, then re-measure. This loop is the heart of the disciplined approach in What Reliable Zero-Shot Classifiers Have in Common.
Step 7: Deploy With Guardrails
Move from a tested prompt to something you can run on real volume safely.
The Action
Pin the model and prompt version, use low-randomness settings for stable output, log inputs and outputs, and route low-confidence or "other" results to human review where the stakes justify it.
- Version-pin model and prompt together
- Log everything for auditing
- Route uncertain cases to a human
The Check
Confirm you can reproduce the same output for the same input and that you have visibility into what the classifier is doing in production. The complete production picture is laid out in the end-to-end walkthrough of classifying with no labeled data.
Step 8: Set Up Ongoing Monitoring
A deployed classifier is not finished; it needs to be watched, because the text it sees will change over time.
The Action
Schedule a periodic re-measurement: pull a fresh sample of recent inputs, hand-label them, and run them through the classifier to check whether accuracy has held. Track the size of the "other" bucket over time, since a growing bucket signals that new kinds of input are arriving that your categories do not cover.
- Re-measure accuracy against fresh samples on a schedule
- Watch the "other" bucket as a drift indicator
- Keep a log of inputs and outputs to investigate problems
The Check
Confirm you have a recurring process, not a one-time check, and that someone owns it. A classifier that was accurate at launch can quietly degrade as the input distribution shifts, and the only way to catch that is to keep measuring. The disciplined version of this monitoring is part of What Reliable Zero-Shot Classifiers Have in Common.
Frequently Asked Questions
How long does this whole procedure take?
For a straightforward task with clear categories, you can get through drafting and small-batch testing in under an hour. Building the validation set is the most time-consuming part, but it is also what makes the result trustworthy. Budget more time there and less everywhere else.
Can I skip the validation set if the early tests look good?
You can, but you will be shipping a classifier you cannot vouch for. The small-batch test catches obvious breakage; only the validation set tells you the real accuracy. For anything beyond a throwaway experiment, build the set.
What do I do when accuracy is stuck on one category?
Look at what it gets confused with. Usually the two definitions overlap, or the category is genuinely subtle. Sharpen the definition first; if that is not enough, add a clarifying example for that category specifically, which moves you toward few-shot for just the hard case.
Should I classify into one category or allow several?
Decide this at Step 1. If an input can genuinely belong to multiple categories, design for multiple labels and instruct accordingly. If it should belong to exactly one, enforce that in the output rule. Mixing the two assumptions mid-build causes confusion.
Key Takeaways
- Follow the steps in order; each removes an error source the next step relies on
- Define distinct, one-sentence categories with an explicit "other" before writing any prompt
- Test on a small varied batch first, then constrain output to a clean parseable label
- A hand-labeled validation set covering every category is what proves the classifier actually works
- Deploy with version pinning, logging, low randomness, and human review for uncertain cases