A Credible First Untrained Classifier in One Sitting

The appeal of zero-shot classification is that you can go from an idea to a working classifier in an afternoon, with no labeled data and no training. The trap is that you can also go from an idea to a plausible-looking classifier that quietly mislabels half your data, also in an afternoon. The difference between those two outcomes is a small amount of discipline applied in the right order.

This walkthrough takes you from nothing to a first real result, the kind you can defend to a colleague rather than just demo. It covers the prerequisites you genuinely need, the build steps in sequence, and the validation step that separates a working classifier from a hopeful one. It is deliberately minimal: the fastest credible path, not the most elaborate.

Credible is the operative word. Anyone can get a classifier to return labels. The goal here is a classifier whose error rate you actually know, because a number you can defend is worth more than a demo that looks impressive and falls apart on real data.

Before You Start

The prerequisites that matter

You need three things: a set of categories you can describe clearly, access to a capable language model, and a small batch of real example texts to test against. You do not need labeled training data, a machine learning background, or specialized infrastructure. That is the whole point.

The one prerequisite people skip

You need to confirm a human can do the task from the text alone. Read ten of your real examples and label them yourself. If you cannot decide confidently, the model will not either, and no prompt will rescue a task where the signal is missing. This check takes minutes and saves hours.

Clearly describable, mutually exclusive categories
Access to a capable model
A small batch of real examples
Confirmation that a human can label from the text alone

Step One: Define Your Categories

Write real descriptions

For each category, write one or two sentences stating what belongs. A bare label name is not enough; the model needs a boundary to reason about. Make sure no two categories can both legitimately apply to the same text, because overlap is the top cause of poor results.

Start small

Begin with a handful of categories, not a dozen. Long lists strain the model's attention and introduce ordering bias. You can always split categories later once the basic pipeline works. The framework in Naming the Stages That Turn Raw Labels Into Reliable Sorting treats this Define step as the foundation everything else inherits.

Step Two: Write the Prompt

The minimal structure

Your prompt needs four parts: a brief role, the category list with descriptions, an instruction to return exactly one label from that list, and the text to classify. That is enough for a first result. Constraining output to the exact label set prevents the model from inventing categories.

Add a rationale if categories are subtle

If your categories are at all ambiguous, ask the model to give a one-line reason before its label. This grounds the decision in the text and lifts accuracy on hard cases, a pattern that recurs throughout Classifying Support Tickets Without a Single Labeled Example. For easy categories at high volume, skip it to save tokens.

Step Three: Run It on Real Examples

Start with your test batch

Run the prompt over the small batch of real examples you gathered. Read the outputs yourself. At this stage you are looking for obvious failures, invented labels, systematic confusion between two categories, not a precise accuracy number yet.

Fix the obvious problems first

If the model invents labels, tighten the output constraint. If it confuses two categories, sharpen their descriptions to contrast them. These two fixes resolve most early problems and cost nothing but a prompt edit.

Step Four: Validate Before You Trust It

Build a small audit sample

Hand-label a few hundred real examples purely for measurement, never for the prompt. Compare the model's labels to yours and compute precision and recall per category, not just overall accuracy. This is the step that makes the result credible, and it is detailed in Reading the Signal When Your Classifier Never Saw Training Data.

Decide based on the numbers

If a category underperforms, loop back and sharpen its description. If it stays weak after that, consider adding a few examples, which moves you to few-shot, as covered in Deciding Among No Labels, Few Labels, and Fine-Tuning. Otherwise, you have a working classifier whose error rate you can defend.

Common First-Attempt Mistakes

Treating plausible as correct

The most seductive trap is glancing at outputs that look reasonable and declaring success. Plausible labels and correct labels are different, and only an audit sample tells them apart. Resist the urge to ship on the strength of a good-looking demo, because the gap between the two is exactly where client complaints live.

Starting with too many categories

Beginners often define a dozen categories at once, which strains the model and introduces ordering bias. Start with a handful, prove the pipeline works, and split categories later. A small working classifier beats a large broken one every time.

Forgetting to constrain the output

Without an explicit instruction to return only an allowed label, the model invents new ones and your data needs cleaning. Constraining output to the exact set is a one-line fix that prevents an entire class of downstream mess. Make it part of your first prompt, not a later patch.

Audit before you trust; plausible is not correct
Begin with few categories and split later
Constrain output to the exact allowed labels from the start

Where to Go After Your First Result

Hardening for production

A first result is a prototype, not a production system. Before it drives real decisions, add a path for low-confidence cases, schedule a re-audit to catch drift, and monitor cost against volume. The full pre-launch review lives in Pre-Flight Items Before You Trust a Labelless Classifier.

Scaling the structure

As the classifier grows, the loose steps here harden into a repeatable pipeline with named stages, which makes debugging a controlled experiment rather than guesswork. That structure is the subject of Naming the Stages That Turn Raw Labels Into Reliable Sorting. Adopting it early means you never outgrow your own process.

Knowing when to escalate

If validation shows a category that stays weak after you have sharpened its description, that is the honest signal to add examples or reconsider the approach. Knowing when to leave zero-shot behind is part of using it well, and your audit numbers are what tell you the moment has come.

Frequently Asked Questions

How long does the first working result take?

For a small project, an afternoon. Writing the prompt takes minutes; the validation step, hand-labeling an audit sample, is the longest part at a couple of hours. That validation time is exactly what separates a credible result from a hopeful one.

Do I need to know machine learning?

No. Zero-shot classification needs clear thinking about categories and a willingness to measure, not a machine learning background. The skills that matter are writing precise category descriptions and reading per-category metrics honestly.

Which model should I start with?

Start with a capable general model via a raw API, the simplest possible setup. Match model strength to category difficulty: easy categories run fine on smaller models, while subtle ones benefit from stronger ones. The tool survey covers when to graduate beyond a raw API.

What is the most common beginner mistake?

Skipping validation. The outputs look plausible, the deadline looms, and the classifier ships without anyone knowing its real error rate. Always hand-label an audit sample before you trust the result, no matter how good the outputs look at a glance.

Key Takeaways

You can reach a credible first result in an afternoon with no labeled training data, just clear categories and a capable model.
Confirm a human can label your examples from the text alone before building; missing signal cannot be fixed by any prompt.
Write real category descriptions, keep the list small, and constrain output to the exact allowed labels.
Add a one-line rationale for subtle categories to ground decisions, and skip it for easy high-volume tasks to save tokens.
Validate against a hand-labeled audit sample with per-category metrics; this step is what makes the result defensible.

Before You Start

The prerequisites that matter

The one prerequisite people skip

Clearly describable, mutually exclusive categories
Access to a capable model
A small batch of real examples
Confirmation that a human can label from the text alone

Step One: Define Your Categories

Write real descriptions

Start small

Step Two: Write the Prompt

The minimal structure

Add a rationale if categories are subtle

Step Three: Run It on Real Examples

Start with your test batch

Fix the obvious problems first

Step Four: Validate Before You Trust It

Build a small audit sample

Decide based on the numbers

Common First-Attempt Mistakes

Treating plausible as correct

Starting with too many categories

Forgetting to constrain the output

Audit before you trust; plausible is not correct
Begin with few categories and split later
Constrain output to the exact allowed labels from the start

Where to Go After Your First Result

Hardening for production

Scaling the structure

Knowing when to escalate

Frequently Asked Questions

How long does the first working result take?

Do I need to know machine learning?

Which model should I start with?

What is the most common beginner mistake?

Key Takeaways

You can reach a credible first result in an afternoon with no labeled training data, just clear categories and a capable model.
Confirm a human can label your examples from the text alone before building; missing signal cannot be fixed by any prompt.
Write real category descriptions, keep the list small, and constrain output to the exact allowed labels.
Add a one-line rationale for subtle categories to ground decisions, and skip it for easy high-volume tasks to save tokens.
Validate against a hand-labeled audit sample with per-category metrics; this step is what makes the result defensible.

A Credible First Untrained Classifier in One Sitting

Before You Start

The prerequisites that matter

The one prerequisite people skip

Step One: Define Your Categories

Write real descriptions

Start small

Step Two: Write the Prompt

The minimal structure

Add a rationale if categories are subtle

Step Three: Run It on Real Examples

Start with your test batch

Fix the obvious problems first

Step Four: Validate Before You Trust It

Build a small audit sample

Decide based on the numbers

Common First-Attempt Mistakes

Treating plausible as correct

Starting with too many categories

Forgetting to constrain the output

Where to Go After Your First Result

Hardening for production

Scaling the structure

Knowing when to escalate

Frequently Asked Questions

How long does the first working result take?

Do I need to know machine learning?

Which model should I start with?

What is the most common beginner mistake?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

A Credible First Untrained Classifier in One Sitting

Before You Start

The prerequisites that matter

The one prerequisite people skip

Step One: Define Your Categories

Write real descriptions

Start small

Step Two: Write the Prompt

The minimal structure

Add a rationale if categories are subtle

Step Three: Run It on Real Examples

Start with your test batch

Fix the obvious problems first

Step Four: Validate Before You Trust It

Build a small audit sample

Decide based on the numbers

Common First-Attempt Mistakes

Treating plausible as correct

Starting with too many categories

Forgetting to constrain the output

Where to Go After Your First Result

Hardening for production

Scaling the structure

Knowing when to escalate

Frequently Asked Questions

How long does the first working result take?

Do I need to know machine learning?

Which model should I start with?

What is the most common beginner mistake?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?