Classifying Support Tickets Without a Single Labeled Example

Zero-shot classification prompting asks a language model to sort text into categories it was never explicitly trained to recognize, using nothing but a well-written instruction and the label names themselves. No training data, no fine-tuning, no labeled examples. You describe the categories in plain language, hand the model a piece of text, and it returns a label. When it works, it feels like magic. When it fails, the failure is usually invisible until it has already polluted a downstream report.

The honest way to understand this technique is through specific cases rather than abstract claims. The same prompt structure that classifies support tickets flawlessly can collapse on product reviews because the category boundaries are fuzzier. The difference is rarely the model. It is the clarity of the categories, the framing of the instruction, and whether the text actually contains the signal you are asking the model to detect.

Below are five scenarios drawn from realistic agency and operations work. Each one shows what the prompt looked like, what the model did, and the single factor that explains the outcome. Treat them as a diagnostic library: when your own classifier misbehaves, one of these patterns is usually the culprit.

Sorting Inbound Support Tickets by Department

What the prompt did

A help desk routes incoming messages to billing, technical support, or account management. The prompt named the three departments, gave a one-sentence description of each, and instructed the model to return exactly one label. With clear, mutually exclusive categories and short, intent-heavy tickets, accuracy on a hand-checked sample sat comfortably above ninety percent.

Why it worked

Support tickets are written by people trying to be understood. They state their problem directly, which means the classification signal is dense and near the surface. The categories were also genuinely distinct, so the model rarely had to guess between two plausible answers.

Distinct, non-overlapping categories
Short text with explicit intent
A description per label, not just a bare name

Tagging Product Reviews by Sentiment Nuance

What the prompt did

The same structure was pointed at product reviews, asking for one of five labels: delighted, satisfied, neutral, frustrated, or angry. Accuracy dropped sharply. The model collapsed delighted and satisfied together and frequently mislabeled sarcasm as positive.

Why it struggled

Five sentiment grades are not mutually exclusive in any crisp way. The boundary between satisfied and delighted is a matter of degree, and reasonable humans disagree. The lesson is not that zero-shot fails at sentiment. It is that over-fine categories invite disagreement the model cannot resolve without examples. Collapsing to three labels restored reliable performance. This is the same boundary-clarity problem that shows up across Naming the Stages That Turn Raw Labels Into Reliable Sorting.

Flagging Policy Violations in User-Generated Content

What the prompt did

A moderation pass asked the model to label posts as compliant or violating, with a list of what counted as a violation. It caught obvious cases but missed coded language and let through borderline content that depended on context the prompt never supplied.

Why partial success is still useful

Even at imperfect recall, the classifier worked as a triage layer. It cleared the unambiguous traffic and escalated everything uncertain to a human. The takeaway: zero-shot classification earns its keep as a first filter even when it cannot be the final word. Pair it with the measurement discipline in Reading the Signal When Your Classifier Never Saw Training Data so you know exactly what slips through.

Routing Sales Leads by Buying Stage

What the prompt did

Leads were classified as awareness, consideration, or decision stage from the text of their inbound inquiry. Performance was middling until the prompt added a forced rationale: the model had to state one sentence of evidence before committing to a label.

Why the rationale helped

Asking for evidence before the answer pushes the model to ground its decision in the actual text rather than a surface guess. This single change moved accuracy up noticeably on ambiguous leads, at the cost of more tokens and slightly slower responses.

Require a short rationale before the label
Accept higher latency for higher accuracy on hard cases
Keep the rationale out of the final stored output if you only need the label

Categorizing Internal Documents for a Knowledge Base

What the prompt did

Documents were sorted into roughly twelve topic categories. With that many labels, the model began drifting toward whichever categories appeared first in the list and occasionally invented labels that were not offered.

Why label count and ordering matter

Long category lists strain the model's attention and introduce position bias. Randomizing label order across runs and constraining output to the exact allowed set fixed the invented-label problem. When you genuinely need many categories, a two-stage approach, sorting into broad buckets first and then refining, beats one giant flat list.

Detecting Language for Multilingual Routing

What the prompt did

A global support queue needed to route messages to the correct regional team by language before any content classification happened. The prompt asked the model to identify the primary language of each message from a list of supported languages and return the team code. This is one of the cleanest possible zero-shot tasks, and accuracy was near-perfect across the major languages.

Why it was nearly flawless

Language identification has an unambiguous ground truth and a signal that saturates the text. There is no boundary dispute to resolve, no judgment call about degree. The only failures came from very short messages, a one-word reply like thanks, where there was genuinely not enough signal to decide. The lesson is that the cleaner the ground truth, the better zero-shot performs, which is why it is worth asking whether your task has a crisp answer before you build.

Unambiguous ground truth produces near-perfect results
Failures cluster on extremely short inputs with little signal
Mixed-language messages need a tie-break rule in the prompt

Extracting Intent From Voice Transcripts

What the prompt did

Call-center transcripts were classified by caller intent: cancel, upgrade, complaint, or question. Transcripts are messier than written text, full of filler, false starts, and transcription errors. The team expected this to break zero-shot, and at first it did, with accuracy dragged down by the noise.

What made it recover

The fix was instructing the model to focus on the caller's final stated goal and ignore conversational filler. Adding that single guiding sentence lifted accuracy substantially, because it told the model which part of the noisy text actually carried the signal. The broader takeaway is that when text is noisy, the prompt's job is to point the model at the signal rather than leaving it to guess. This connects to the rationale technique covered in Your Fastest Credible Path to a Working Untrained Classifier, where grounding the decision in specific evidence improves messy inputs.

Why noisy inputs are still workable

Noise does not doom zero-shot; vague instructions do. A prompt that tells the model where to look survives a surprising amount of transcription garbage, which makes this approach viable even for imperfect speech-to-text pipelines.

Reading the Pattern Across All Five Cases

The common thread

Across every scenario, one variable predicted success better than any other: whether the categories had a crisp, agreed boundary and whether the signal lived in the text. Department routing and language detection had both and excelled. Sentiment grades lacked a crisp boundary and struggled until collapsed. Noisy transcripts had the signal buried, and a pointing instruction recovered it.

What this means for your own builds

Before writing a prompt, run the two-question test. Can a human label these from the text alone, and do reasonable people agree on the categories? If yes, zero-shot is a strong bet. If no, expect to either restructure the categories or add examples. The measurement discipline in Reading the Signal When Your Classifier Never Saw Training Data will tell you which case you are actually in rather than which one you hoped for.

Frequently Asked Questions

Do these examples require a large, expensive model?

The clean cases ran acceptably on mid-tier models. The harder ones, nuanced sentiment and policy edge cases, improved meaningfully on stronger models. Match model size to category difficulty rather than defaulting to the largest option for every task.

How many categories is too many for zero-shot?

There is no fixed ceiling, but reliability tends to degrade past eight to ten flat labels. Beyond that, structure the problem hierarchically so each individual decision stays small and the model never weighs a dozen options at once.

What is the fastest way to tell if zero-shot will work for my data?

Hand-label fifty examples yourself, run the prompt, and compare. Fifty is enough to expose obvious failure patterns within an hour and costs almost nothing. If you cannot agree with yourself on the labels, the model will not either.

Should I show the model example texts despite calling this zero-shot?

If you add examples, you have moved to few-shot prompting, which is a legitimate and often better choice. Zero-shot is the baseline you reach for when you have no labeled data at all or want the simplest possible starting point.

Key Takeaways

Zero-shot classification succeeds when categories are distinct and the signal sits near the surface of short, intent-rich text.
Over-fine categories, like five sentiment grades, invite disagreement the model cannot resolve without examples; collapse them.
Imperfect classifiers still earn their place as triage layers that escalate uncertain cases to humans.
Forcing a one-sentence rationale before the label grounds decisions and lifts accuracy on ambiguous inputs.
Long, fixed-order label lists cause position bias and invented labels; randomize order and structure large taxonomies hierarchically.

Sorting Inbound Support Tickets by Department

What the prompt did

Why it worked

Distinct, non-overlapping categories
Short text with explicit intent
A description per label, not just a bare name

Tagging Product Reviews by Sentiment Nuance

What the prompt did

Why it struggled

Flagging Policy Violations in User-Generated Content

What the prompt did

Why partial success is still useful

Routing Sales Leads by Buying Stage

What the prompt did

Why the rationale helped

Require a short rationale before the label
Accept higher latency for higher accuracy on hard cases
Keep the rationale out of the final stored output if you only need the label

Categorizing Internal Documents for a Knowledge Base

What the prompt did

Why label count and ordering matter

Detecting Language for Multilingual Routing

What the prompt did

Why it was nearly flawless

Unambiguous ground truth produces near-perfect results
Failures cluster on extremely short inputs with little signal
Mixed-language messages need a tie-break rule in the prompt

Extracting Intent From Voice Transcripts

What the prompt did

What made it recover

Why noisy inputs are still workable

Reading the Pattern Across All Five Cases

The common thread

What this means for your own builds

Frequently Asked Questions

Do these examples require a large, expensive model?

How many categories is too many for zero-shot?

What is the fastest way to tell if zero-shot will work for my data?

Should I show the model example texts despite calling this zero-shot?

Key Takeaways

Zero-shot classification succeeds when categories are distinct and the signal sits near the surface of short, intent-rich text.
Over-fine categories, like five sentiment grades, invite disagreement the model cannot resolve without examples; collapse them.
Imperfect classifiers still earn their place as triage layers that escalate uncertain cases to humans.
Forcing a one-sentence rationale before the label grounds decisions and lifts accuracy on ambiguous inputs.
Long, fixed-order label lists cause position bias and invented labels; randomize order and structure large taxonomies hierarchically.

Classifying Support Tickets Without a Single Labeled Example

Sorting Inbound Support Tickets by Department

What the prompt did

Why it worked

Tagging Product Reviews by Sentiment Nuance

What the prompt did

Why it struggled

Flagging Policy Violations in User-Generated Content

What the prompt did

Why partial success is still useful

Routing Sales Leads by Buying Stage

What the prompt did

Why the rationale helped

Categorizing Internal Documents for a Knowledge Base

What the prompt did

Why label count and ordering matter

Detecting Language for Multilingual Routing

What the prompt did

Why it was nearly flawless

Extracting Intent From Voice Transcripts

What the prompt did

What made it recover

Why noisy inputs are still workable

Reading the Pattern Across All Five Cases

The common thread

What this means for your own builds

Frequently Asked Questions

Do these examples require a large, expensive model?

How many categories is too many for zero-shot?

What is the fastest way to tell if zero-shot will work for my data?

Should I show the model example texts despite calling this zero-shot?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Classifying Support Tickets Without a Single Labeled Example

Sorting Inbound Support Tickets by Department

What the prompt did

Why it worked

Tagging Product Reviews by Sentiment Nuance

What the prompt did

Why it struggled

Flagging Policy Violations in User-Generated Content

What the prompt did

Why partial success is still useful

Routing Sales Leads by Buying Stage

What the prompt did

Why the rationale helped

Categorizing Internal Documents for a Knowledge Base

What the prompt did

Why label count and ordering matter

Detecting Language for Multilingual Routing

What the prompt did

Why it was nearly flawless

Extracting Intent From Voice Transcripts

What the prompt did

What made it recover

Why noisy inputs are still workable

Reading the Pattern Across All Five Cases

The common thread

What this means for your own builds

Frequently Asked Questions

Do these examples require a large, expensive model?

How many categories is too many for zero-shot?

What is the fastest way to tell if zero-shot will work for my data?

Should I show the model example texts despite calling this zero-shot?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?