Sorting Text by Description Alone, One Step at a Time

There is a difference between understanding zero-shot classification and actually building one that works. This article is about the second thing. It is a step-by-step procedure you can follow start to finish, in order, to produce a classifier that sorts your text into your categories with measurable accuracy. No theory dumps, no detours — just do this, then this, then this.

The procedure assumes you have a classification task in mind: a pile of text and a set of categories you want each item sorted into. Support tickets into types, feedback into themes, documents into topics — the steps are the same regardless. Each step has a concrete action and a way to tell whether you did it right before moving on.

Work through them in sequence. Skipping ahead is the most common reason classifiers come out unreliable, because each step removes a source of error that the next step depends on.

Step 1: Write Down Your Categories

Before touching a model, list the categories on paper. This forces clarity you will otherwise skip.

The Action

Write each category name followed by a one-sentence definition of what belongs in it. Add an "other" category for text that fits nowhere. If you cannot define a category in one clear sentence, it is too fuzzy and needs splitting or merging.

List every category name
Write a one-line definition for each
Add an explicit "other" or "none" category

The Check

Read your definitions and ask whether any two overlap. If "complaint" and "negative feedback" could both apply to the same text, redefine them until they are distinct. Distinct categories are the precondition for everything that follows, as explained in the from-scratch introduction to zero-shot classification.

Step 2: Draft the Prompt

Now turn the categories into an instruction the model can follow.

The Action

Write a prompt with four parts in order: the task ("Classify the following text into exactly one category"), the labeled list with definitions, a placeholder for the input, and a strict output instruction ("Respond with only the category name").

State the task plainly
Include the labels with their definitions
End with a tight output format rule

The Check

Read the prompt as if you were the model. Is it obvious what to do, what the options are, and how to answer? If anything is ambiguous, tighten it now. Ambiguity here becomes errors later.

Step 3: Test on a Handful of Inputs

Do not classify everything yet. Run a small batch first.

The Action

Pick five to ten varied inputs, including at least one you expect to be tricky. Run each through the prompt and look at the answers.

Choose inputs that span your categories
Include a deliberately ambiguous case
Read every output, not just the count

The Check

Did the model return only the label, in the expected format? Did the obvious cases come out right? If the format is off, fix the output instruction. If easy cases are wrong, your definitions need work. This early check catches most problems cheaply, before they scale.

Step 4: Constrain and Clean the Output

Make the output reliably machine-readable so you can use it at scale.

The Action

If you saw any stray explanations or formatting variation, tighten the output rule — specify the exact label spelling and that nothing else should appear. For programmatic use, ask for structured output like a JSON field with the label.

Pin the exact allowed label values
Forbid commentary or hedging
Use structured output for automated pipelines

The Check

Run the small batch again and confirm every output is a clean, parseable label from your list. Unconstrained output is a top failure mode, covered in depth in Eight Quiet Ways Zero-Shot Classifiers Go Wrong.

Step 5: Build a Validation Set

You cannot claim the classifier works until you have measured it. This step creates the measuring stick.

The Action

Hand-label a few hundred representative inputs with the correct category yourself. This is your ground truth. Spread it across all categories so each one is tested.

Hand-label a few hundred varied inputs
Cover every category, including "other"
Keep this set fixed so results are comparable over time

The Check

Confirm your set includes examples of every category and some genuinely hard cases. A validation set that only contains easy inputs will overstate your accuracy.

Step 6: Measure and Fix

Run the classifier against the validation set and act on what you find.

The Action

Classify the whole validation set and compare to your labels. Compute accuracy per category, not just overall. Look at which categories get confused with which others.

Compute per-category accuracy
Examine the specific confusions
Tighten the definitions of confused categories and re-run

The Check

Are the per-category numbers acceptable for your use? If one category is weak, sharpen its definition or add a clarifying example, then re-measure. This loop is the heart of the disciplined approach in What Reliable Zero-Shot Classifiers Have in Common.

Step 7: Deploy With Guardrails

Move from a tested prompt to something you can run on real volume safely.

The Action

Pin the model and prompt version, use low-randomness settings for stable output, log inputs and outputs, and route low-confidence or "other" results to human review where the stakes justify it.

Version-pin model and prompt together
Log everything for auditing
Route uncertain cases to a human

The Check

Confirm you can reproduce the same output for the same input and that you have visibility into what the classifier is doing in production. The complete production picture is laid out in the end-to-end walkthrough of classifying with no labeled data.

Step 8: Set Up Ongoing Monitoring

A deployed classifier is not finished; it needs to be watched, because the text it sees will change over time.

The Action

Schedule a periodic re-measurement: pull a fresh sample of recent inputs, hand-label them, and run them through the classifier to check whether accuracy has held. Track the size of the "other" bucket over time, since a growing bucket signals that new kinds of input are arriving that your categories do not cover.

Re-measure accuracy against fresh samples on a schedule
Watch the "other" bucket as a drift indicator
Keep a log of inputs and outputs to investigate problems

The Check

Confirm you have a recurring process, not a one-time check, and that someone owns it. A classifier that was accurate at launch can quietly degrade as the input distribution shifts, and the only way to catch that is to keep measuring. The disciplined version of this monitoring is part of What Reliable Zero-Shot Classifiers Have in Common.

Frequently Asked Questions

How long does this whole procedure take?

For a straightforward task with clear categories, you can get through drafting and small-batch testing in under an hour. Building the validation set is the most time-consuming part, but it is also what makes the result trustworthy. Budget more time there and less everywhere else.

Can I skip the validation set if the early tests look good?

You can, but you will be shipping a classifier you cannot vouch for. The small-batch test catches obvious breakage; only the validation set tells you the real accuracy. For anything beyond a throwaway experiment, build the set.

What do I do when accuracy is stuck on one category?

Look at what it gets confused with. Usually the two definitions overlap, or the category is genuinely subtle. Sharpen the definition first; if that is not enough, add a clarifying example for that category specifically, which moves you toward few-shot for just the hard case.

Should I classify into one category or allow several?

Decide this at Step 1. If an input can genuinely belong to multiple categories, design for multiple labels and instruct accordingly. If it should belong to exactly one, enforce that in the output rule. Mixing the two assumptions mid-build causes confusion.

Key Takeaways

Follow the steps in order; each removes an error source the next step relies on
Define distinct, one-sentence categories with an explicit "other" before writing any prompt
Test on a small varied batch first, then constrain output to a clean parseable label
A hand-labeled validation set covering every category is what proves the classifier actually works
Deploy with version pinning, logging, low randomness, and human review for uncertain cases

Work through them in sequence. Skipping ahead is the most common reason classifiers come out unreliable, because each step removes a source of error that the next step depends on.

Step 1: Write Down Your Categories

Before touching a model, list the categories on paper. This forces clarity you will otherwise skip.

The Action

List every category name
Write a one-line definition for each
Add an explicit "other" or "none" category

The Check

Step 2: Draft the Prompt

Now turn the categories into an instruction the model can follow.

The Action

State the task plainly
Include the labels with their definitions
End with a tight output format rule

The Check

Read the prompt as if you were the model. Is it obvious what to do, what the options are, and how to answer? If anything is ambiguous, tighten it now. Ambiguity here becomes errors later.

Step 3: Test on a Handful of Inputs

Do not classify everything yet. Run a small batch first.

The Action

Pick five to ten varied inputs, including at least one you expect to be tricky. Run each through the prompt and look at the answers.

Choose inputs that span your categories
Include a deliberately ambiguous case
Read every output, not just the count

The Check

Step 4: Constrain and Clean the Output

Make the output reliably machine-readable so you can use it at scale.

The Action

Pin the exact allowed label values
Forbid commentary or hedging
Use structured output for automated pipelines

The Check

Step 5: Build a Validation Set

You cannot claim the classifier works until you have measured it. This step creates the measuring stick.

The Action

Hand-label a few hundred representative inputs with the correct category yourself. This is your ground truth. Spread it across all categories so each one is tested.

Hand-label a few hundred varied inputs
Cover every category, including "other"
Keep this set fixed so results are comparable over time

The Check

Confirm your set includes examples of every category and some genuinely hard cases. A validation set that only contains easy inputs will overstate your accuracy.

Step 6: Measure and Fix

Run the classifier against the validation set and act on what you find.

The Action

Classify the whole validation set and compare to your labels. Compute accuracy per category, not just overall. Look at which categories get confused with which others.

Compute per-category accuracy
Examine the specific confusions
Tighten the definitions of confused categories and re-run

The Check

Step 7: Deploy With Guardrails

Move from a tested prompt to something you can run on real volume safely.

The Action

Pin the model and prompt version, use low-randomness settings for stable output, log inputs and outputs, and route low-confidence or "other" results to human review where the stakes justify it.

Version-pin model and prompt together
Log everything for auditing
Route uncertain cases to a human

The Check

Step 8: Set Up Ongoing Monitoring

A deployed classifier is not finished; it needs to be watched, because the text it sees will change over time.

The Action

Re-measure accuracy against fresh samples on a schedule
Watch the "other" bucket as a drift indicator
Keep a log of inputs and outputs to investigate problems

The Check

Frequently Asked Questions

How long does this whole procedure take?

Can I skip the validation set if the early tests look good?

What do I do when accuracy is stuck on one category?

Should I classify into one category or allow several?

Key Takeaways

Follow the steps in order; each removes an error source the next step relies on
Define distinct, one-sentence categories with an explicit "other" before writing any prompt
Test on a small varied batch first, then constrain output to a clean parseable label
A hand-labeled validation set covering every category is what proves the classifier actually works
Deploy with version pinning, logging, low randomness, and human review for uncertain cases

Sorting Text by Description Alone, One Step at a Time

Step 1: Write Down Your Categories

The Action

The Check

Step 2: Draft the Prompt

The Action

The Check

Step 3: Test on a Handful of Inputs

The Action

The Check

Step 4: Constrain and Clean the Output

The Action

The Check

Step 5: Build a Validation Set

The Action

The Check

Step 6: Measure and Fix

The Action

The Check

Step 7: Deploy With Guardrails

The Action

The Check

Step 8: Set Up Ongoing Monitoring

The Action

The Check

Frequently Asked Questions

How long does this whole procedure take?

Can I skip the validation set if the early tests look good?

What do I do when accuracy is stuck on one category?

Should I classify into one category or allow several?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Sorting Text by Description Alone, One Step at a Time

Step 1: Write Down Your Categories

The Action

The Check

Step 2: Draft the Prompt

The Action

The Check

Step 3: Test on a Handful of Inputs

The Action

The Check

Step 4: Constrain and Clean the Output

The Action

The Check

Step 5: Build a Validation Set

The Action

The Check

Step 6: Measure and Fix

The Action

The Check

Step 7: Deploy With Guardrails

The Action

The Check

Step 8: Set Up Ongoing Monitoring

The Action

The Check

Frequently Asked Questions

How long does this whole procedure take?

Can I skip the validation set if the early tests look good?

What do I do when accuracy is stuck on one category?

Should I classify into one category or allow several?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?