AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Step 1: Write Down Your CategoriesThe ActionThe CheckStep 2: Draft the PromptThe ActionThe CheckStep 3: Test on a Handful of InputsThe ActionThe CheckStep 4: Constrain and Clean the OutputThe ActionThe CheckStep 5: Build a Validation SetThe ActionThe CheckStep 6: Measure and FixThe ActionThe CheckStep 7: Deploy With GuardrailsThe ActionThe CheckStep 8: Set Up Ongoing MonitoringThe ActionThe CheckFrequently Asked QuestionsHow long does this whole procedure take?Can I skip the validation set if the early tests look good?What do I do when accuracy is stuck on one category?Should I classify into one category or allow several?Key Takeaways
Home/Blog/Sorting Text by Description Alone, One Step at a Time
General

Sorting Text by Description Alone, One Step at a Time

A

Agency Script Editorial

Editorial Team

·March 13, 2022·6 min read
zero-shot classification promptingzero-shot classification prompting how tozero-shot classification prompting guideprompt engineering

There is a difference between understanding zero-shot classification and actually building one that works. This article is about the second thing. It is a step-by-step procedure you can follow start to finish, in order, to produce a classifier that sorts your text into your categories with measurable accuracy. No theory dumps, no detours — just do this, then this, then this.

The procedure assumes you have a classification task in mind: a pile of text and a set of categories you want each item sorted into. Support tickets into types, feedback into themes, documents into topics — the steps are the same regardless. Each step has a concrete action and a way to tell whether you did it right before moving on.

Work through them in sequence. Skipping ahead is the most common reason classifiers come out unreliable, because each step removes a source of error that the next step depends on.

Step 1: Write Down Your Categories

Before touching a model, list the categories on paper. This forces clarity you will otherwise skip.

The Action

Write each category name followed by a one-sentence definition of what belongs in it. Add an "other" category for text that fits nowhere. If you cannot define a category in one clear sentence, it is too fuzzy and needs splitting or merging.

  • List every category name
  • Write a one-line definition for each
  • Add an explicit "other" or "none" category

The Check

Read your definitions and ask whether any two overlap. If "complaint" and "negative feedback" could both apply to the same text, redefine them until they are distinct. Distinct categories are the precondition for everything that follows, as explained in the from-scratch introduction to zero-shot classification.

Step 2: Draft the Prompt

Now turn the categories into an instruction the model can follow.

The Action

Write a prompt with four parts in order: the task ("Classify the following text into exactly one category"), the labeled list with definitions, a placeholder for the input, and a strict output instruction ("Respond with only the category name").

  • State the task plainly
  • Include the labels with their definitions
  • End with a tight output format rule

The Check

Read the prompt as if you were the model. Is it obvious what to do, what the options are, and how to answer? If anything is ambiguous, tighten it now. Ambiguity here becomes errors later.

Step 3: Test on a Handful of Inputs

Do not classify everything yet. Run a small batch first.

The Action

Pick five to ten varied inputs, including at least one you expect to be tricky. Run each through the prompt and look at the answers.

  • Choose inputs that span your categories
  • Include a deliberately ambiguous case
  • Read every output, not just the count

The Check

Did the model return only the label, in the expected format? Did the obvious cases come out right? If the format is off, fix the output instruction. If easy cases are wrong, your definitions need work. This early check catches most problems cheaply, before they scale.

Step 4: Constrain and Clean the Output

Make the output reliably machine-readable so you can use it at scale.

The Action

If you saw any stray explanations or formatting variation, tighten the output rule — specify the exact label spelling and that nothing else should appear. For programmatic use, ask for structured output like a JSON field with the label.

  • Pin the exact allowed label values
  • Forbid commentary or hedging
  • Use structured output for automated pipelines

The Check

Run the small batch again and confirm every output is a clean, parseable label from your list. Unconstrained output is a top failure mode, covered in depth in Eight Quiet Ways Zero-Shot Classifiers Go Wrong.

Step 5: Build a Validation Set

You cannot claim the classifier works until you have measured it. This step creates the measuring stick.

The Action

Hand-label a few hundred representative inputs with the correct category yourself. This is your ground truth. Spread it across all categories so each one is tested.

  • Hand-label a few hundred varied inputs
  • Cover every category, including "other"
  • Keep this set fixed so results are comparable over time

The Check

Confirm your set includes examples of every category and some genuinely hard cases. A validation set that only contains easy inputs will overstate your accuracy.

Step 6: Measure and Fix

Run the classifier against the validation set and act on what you find.

The Action

Classify the whole validation set and compare to your labels. Compute accuracy per category, not just overall. Look at which categories get confused with which others.

  • Compute per-category accuracy
  • Examine the specific confusions
  • Tighten the definitions of confused categories and re-run

The Check

Are the per-category numbers acceptable for your use? If one category is weak, sharpen its definition or add a clarifying example, then re-measure. This loop is the heart of the disciplined approach in What Reliable Zero-Shot Classifiers Have in Common.

Step 7: Deploy With Guardrails

Move from a tested prompt to something you can run on real volume safely.

The Action

Pin the model and prompt version, use low-randomness settings for stable output, log inputs and outputs, and route low-confidence or "other" results to human review where the stakes justify it.

  • Version-pin model and prompt together
  • Log everything for auditing
  • Route uncertain cases to a human

The Check

Confirm you can reproduce the same output for the same input and that you have visibility into what the classifier is doing in production. The complete production picture is laid out in the end-to-end walkthrough of classifying with no labeled data.

Step 8: Set Up Ongoing Monitoring

A deployed classifier is not finished; it needs to be watched, because the text it sees will change over time.

The Action

Schedule a periodic re-measurement: pull a fresh sample of recent inputs, hand-label them, and run them through the classifier to check whether accuracy has held. Track the size of the "other" bucket over time, since a growing bucket signals that new kinds of input are arriving that your categories do not cover.

  • Re-measure accuracy against fresh samples on a schedule
  • Watch the "other" bucket as a drift indicator
  • Keep a log of inputs and outputs to investigate problems

The Check

Confirm you have a recurring process, not a one-time check, and that someone owns it. A classifier that was accurate at launch can quietly degrade as the input distribution shifts, and the only way to catch that is to keep measuring. The disciplined version of this monitoring is part of What Reliable Zero-Shot Classifiers Have in Common.

Frequently Asked Questions

How long does this whole procedure take?

For a straightforward task with clear categories, you can get through drafting and small-batch testing in under an hour. Building the validation set is the most time-consuming part, but it is also what makes the result trustworthy. Budget more time there and less everywhere else.

Can I skip the validation set if the early tests look good?

You can, but you will be shipping a classifier you cannot vouch for. The small-batch test catches obvious breakage; only the validation set tells you the real accuracy. For anything beyond a throwaway experiment, build the set.

What do I do when accuracy is stuck on one category?

Look at what it gets confused with. Usually the two definitions overlap, or the category is genuinely subtle. Sharpen the definition first; if that is not enough, add a clarifying example for that category specifically, which moves you toward few-shot for just the hard case.

Should I classify into one category or allow several?

Decide this at Step 1. If an input can genuinely belong to multiple categories, design for multiple labels and instruct accordingly. If it should belong to exactly one, enforce that in the output rule. Mixing the two assumptions mid-build causes confusion.

Key Takeaways

  • Follow the steps in order; each removes an error source the next step relies on
  • Define distinct, one-sentence categories with an explicit "other" before writing any prompt
  • Test on a small varied batch first, then constrain output to a clean parseable label
  • A hand-labeled validation set covering every category is what proves the classifier actually works
  • Deploy with version pinning, logging, low randomness, and human review for uncertain cases

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification