Delivering Zero-Shot and Few-Shot Learning Solutions: The Agency Advantage

A legal technology startup came to a four-person AI agency in New York with a classification problem: they needed to categorize incoming legal documents into 47 document types — contracts, depositions, motions, briefs, and 43 others. Traditional supervised learning would have required thousands of labeled examples per category. The client had about 200 labeled documents total, unevenly distributed across categories. Some categories had 15 examples. Others had zero.

The agency's previous approach would have been to tell the client "you need more labeled data" and scope a three-month data labeling project before any model development. Instead, they built a zero-shot classification system using a large language model. The system took a document and a list of category descriptions and predicted the most appropriate category — without any labeled training examples. Out of the box, it hit 71% accuracy across all 47 categories.

Then they added few-shot learning. For the 20 categories where the client had at least 10 labeled examples, they used those examples to guide the model. Accuracy on those categories jumped to 84%. For the remaining 27 categories with fewer examples, the zero-shot approach held at 71%. The system went into production in five weeks — not five months.

The client's CTO described it as "magic." It was not magic. It was a fundamental shift in how AI agencies can deliver value: instead of starting with "how much labeled data do you have?" you start with "what problem do you need solved?" and figure out the data requirements after.

The Zero-Shot and Few-Shot Revolution for Agency Work

Zero-shot and few-shot learning fundamentally change the economics of AI agency delivery. Here is why:

Data labeling was your biggest cost center. For traditional supervised learning, you needed thousands to millions of labeled examples. Getting those labels required weeks or months of data preparation. With zero-shot and few-shot approaches, you can deliver working systems with minimal or no labeled data.

Time to value collapses. Instead of a 6-month project (3 months labeling, 2 months training, 1 month deployment), you deliver a working prototype in days and a production system in weeks.

More problems become solvable. Clients with niche classification needs, rare event detection requirements, or rapidly changing categories could not use traditional ML because they could never accumulate enough labeled data. Zero-shot and few-shot approaches make these problems tractable.

You can serve smaller clients. A startup with 500 support tickets cannot train a traditional text classifier. But they can use zero-shot classification to automatically route those tickets today.

Proof of concepts close faster. When you can demonstrate a working solution in a one-week sprint instead of a three-month data labeling initiative, clients commit faster and with less risk.

Understanding the Approaches

Zero-Shot Learning

What it is: Making predictions on categories the model has never been explicitly trained on, using only a description of the category.

How it works in practice: You provide the model with an input (text, image) and a set of candidate labels with descriptions. The model uses its pre-trained knowledge to match the input to the most appropriate label.

Example in natural language:

Input: "The defendant filed a motion requesting the court to compel the plaintiff to produce financial records." Candidate labels: ["Motion to Compel", "Motion to Dismiss", "Summary Judgment", "Discovery Request", "Complaint"] Output: "Motion to Compel" (confidence: 0.89)

The model was never trained on legal documents specifically. It uses its general language understanding to make the classification.

When zero-shot works well:

Text classification where category names are descriptive
Sentiment analysis with custom sentiment scales
Topic categorization
Intent detection for new, previously undefined intents
Image classification using CLIP-style models with descriptive labels

When zero-shot struggles:

Highly technical domains where the category distinctions require specialized knowledge
Categories that are semantically similar (distinguishing between 10 types of legal motions is harder than distinguishing a motion from a contract)
Numerical prediction tasks
Tasks requiring reasoning over structured data

Few-Shot Learning

What it is: Making predictions using only a handful of labeled examples per category — typically 1 to 50 examples.

How it works in practice: You provide the model with a few examples of each category, then ask it to classify new inputs based on those examples. The examples serve as a "template" for what each category looks like.

Two main implementation approaches:

In-context learning (prompt-based): Include examples directly in the prompt to a large language model. The model uses the examples as context to classify new inputs.

Prompt structure:

"Here are examples of 'Motion to Compel': [example 1], [example 2], [example 3]"
"Here are examples of 'Motion to Dismiss': [example 1], [example 2], [example 3]"
"Classify this document: [new document]"

Metric learning (embedding-based): Encode examples into an embedding space, then classify new inputs based on their proximity to the example embeddings. This is faster at inference time and scales better to many categories.

When few-shot works well:

When you have 5-50 examples per category (the sweet spot)
When categories are semantically distinct
When examples are representative of the category variation
When the task is classification, not generation

When to prefer few-shot over zero-shot:

When zero-shot accuracy is below your threshold
When you have examples available (even a handful)
When category names alone are insufficient to distinguish categories
When you need higher precision on specific categories

Delivery Patterns for Zero-Shot and Few-Shot Systems

Pattern 1: The Zero-Shot Prototype to Few-Shot Production Pipeline

This is the most common delivery pattern for agency work.

Week 1: Zero-shot prototype.

Set up the base model (typically a large language model like GPT-4, Claude, or an open-source alternative)
Define the label taxonomy with the client
Write descriptive label definitions
Run zero-shot classification on a sample of the client's data
Measure accuracy and identify weak categories

Week 2: Few-shot enhancement.

For categories where zero-shot underperforms, collect 10-20 examples from the client
Implement few-shot prompting or embedding-based classification
Measure the improvement
Identify categories that still underperform

Week 3: Hybrid system.

Use few-shot for categories with sufficient examples
Use zero-shot for categories without examples
Add confidence thresholds — route low-confidence predictions to human review
Implement the feedback loop: human-reviewed predictions become new few-shot examples

Week 4-5: Production deployment.

Build the serving infrastructure
Implement monitoring (accuracy tracking, confidence distribution, category distribution)
Set up the human-in-the-loop review interface
Deploy and validate

Total delivery time: 5 weeks. Compare that to 5 months for a traditional supervised learning approach.

Pattern 2: The Embedding-Based Few-Shot System

Best for: classification tasks with many categories and moderate accuracy requirements.

Encode all few-shot examples using a sentence transformer or similar embedding model
Store example embeddings in a vector database (Pinecone, Weaviate, Qdrant)
For new inputs, encode the input and find the nearest example embeddings
Classify based on the majority category of the K nearest examples

Advantages:

Fast inference (embedding + vector search is much cheaper than LLM inference)
Easy to add new categories (just add examples, no retraining)
Scales to thousands of categories
Interpretable — you can show the user which examples the prediction is based on

Disadvantages:

Lower accuracy than LLM-based approaches for complex tasks
Requires good embeddings for the domain
Sensitive to the quality and representativeness of examples

Pattern 3: The Progressive Refinement Pipeline

Best for: clients who will accumulate labeled data over time.

Phase 1 (Week 1-3): Zero-shot deployment.

Deploy a zero-shot system for immediate value
Every prediction is logged with its confidence score
Low-confidence predictions are routed to human review

Phase 2 (Month 1-3): Few-shot enhancement.

As human-reviewed predictions accumulate, use them as few-shot examples
Accuracy improves automatically as the example set grows
Monitor which categories benefit most from additional examples

Phase 3 (Month 3-6): Fine-tuned model.

Once sufficient labeled data has accumulated (typically 500+ examples per category), fine-tune a domain-specific model
The fine-tuned model replaces the few-shot system for high-volume categories
Few-shot continues for low-volume and new categories

Phase 4 (Ongoing): Hybrid system.

Fine-tuned models handle the common, well-represented categories
Few-shot handles uncommon and new categories
Zero-shot handles never-before-seen categories
The system automatically promotes categories from zero-shot to few-shot to fine-tuned as data accumulates

This pattern is excellent for agency recurring revenue. Each phase is a separate engagement, and the client sees continuous improvement that justifies ongoing investment.

Practical Implementation Considerations

Prompt Engineering for Zero-Shot and Few-Shot

Label descriptions matter enormously for zero-shot. "Motion to Compel" is okay. "A legal motion filed by one party to force the opposing party to comply with a discovery request or court order" is much better. Invest time in writing precise, descriptive label definitions.

Example selection matters enormously for few-shot. The examples you choose shape the model's understanding of each category. Select examples that:

Represent the diversity within the category (short and long, formal and informal, clear-cut and borderline)
Are clearly within the category (avoid borderline examples that could belong to multiple categories)
Cover the most common variations

Example ordering can affect results. Recency bias in language models means the last few examples in the prompt get disproportionate weight. Randomize example order across predictions to reduce this effect.

Prompt formatting affects accuracy. Structured prompts with clear delimiters between examples and consistent formatting outperform unstructured prompts. Use explicit separators, numbered examples, and consistent label formatting.

Cost Management

LLM-based zero-shot and few-shot systems incur per-prediction API costs. For high-volume applications, these costs add up.

Cost optimization strategies:

Caching. If the same or very similar inputs appear frequently, cache the predictions. A simple hash-based cache can reduce API calls by 30-60% for many applications.
Tiered classification. Use a cheap, fast model (embedding-based) as a first pass. Only route uncertain predictions to the expensive LLM.
Batch processing. For non-real-time applications, batch predictions and send them in bulk. Many API providers offer lower per-token costs for batch requests.
Model distillation. Use the LLM's predictions to train a smaller, cheaper model. Once the smaller model reaches acceptable accuracy, switch to it for production inference.

Monitoring and Evaluation

Track accuracy by category. Overall accuracy can mask poor performance on specific categories. Monitor per-category precision and recall.

Track confidence distributions. If the model's confidence scores are drifting downward, it may be encountering inputs that are increasingly different from what it expects.

Track human override rates. If human reviewers frequently override the model's predictions for a specific category, the model is not performing well on that category. Use the overrides as new examples to improve.

Run periodic accuracy audits. Sample predictions, have humans label them independently, and compare. This catches gradual degradation that automated metrics might miss.

Pricing Zero-Shot and Few-Shot Projects

Do not price these projects based on the time they take to build. A zero-shot system that takes two weeks to deploy solves the same business problem as a traditional system that takes five months. Price on value, not effort.

Typical pricing:

Zero-shot prototype (1-2 weeks): $15,000 - $30,000
Few-shot production system (3-5 weeks): $40,000 - $80,000
Progressive refinement (ongoing): $5,000 - $10,000 per month
Monitoring and optimization: $3,000 - $6,000 per month

The ongoing costs include API fees. Either pass these through to the client at cost plus a management fee, or include them in the monthly retainer with a volume cap.

Value framing: "This system categorizes 10,000 documents per month that would otherwise require 2 full-time paralegals at $65,000 each. The system costs $8,000 per month including API fees and maintenance. That is a 94% cost reduction with faster turnaround."

Your Next Step

Identify one client project where labeled data is the bottleneck — either you do not have enough labeled data to train a traditional model, or the time to label data is delaying the project. Build a zero-shot prototype using an LLM API. Evaluate it on whatever labeled data you do have. If it hits 70% or better accuracy, you have a viable starting point that can be deployed in weeks and improved over time. If it hits 60% or below, try adding 10-20 examples per category for the underperforming classes. The improvement from zero-shot to few-shot is often dramatic enough to cross the viability threshold.

Delivering Zero-Shot and Few-Shot Learning Solutions: The Agency Advantage

The Zero-Shot and Few-Shot Revolution for Agency Work

Zero-shot and few-shot learning fundamentally change the economics of AI agency delivery. Here is why:

Time to value collapses. Instead of a 6-month project (3 months labeling, 2 months training, 1 month deployment), you deliver a working prototype in days and a production system in weeks.

You can serve smaller clients. A startup with 500 support tickets cannot train a traditional text classifier. But they can use zero-shot classification to automatically route those tickets today.

Proof of concepts close faster. When you can demonstrate a working solution in a one-week sprint instead of a three-month data labeling initiative, clients commit faster and with less risk.

Understanding the Approaches

Zero-Shot Learning

What it is: Making predictions on categories the model has never been explicitly trained on, using only a description of the category.

Example in natural language:

The model was never trained on legal documents specifically. It uses its general language understanding to make the classification.

When zero-shot works well:

Text classification where category names are descriptive
Sentiment analysis with custom sentiment scales
Topic categorization
Intent detection for new, previously undefined intents
Image classification using CLIP-style models with descriptive labels

When zero-shot struggles:

Highly technical domains where the category distinctions require specialized knowledge
Categories that are semantically similar (distinguishing between 10 types of legal motions is harder than distinguishing a motion from a contract)
Numerical prediction tasks
Tasks requiring reasoning over structured data

Few-Shot Learning

What it is: Making predictions using only a handful of labeled examples per category — typically 1 to 50 examples.

Two main implementation approaches:

In-context learning (prompt-based): Include examples directly in the prompt to a large language model. The model uses the examples as context to classify new inputs.

Prompt structure:

"Here are examples of 'Motion to Compel': [example 1], [example 2], [example 3]"
"Here are examples of 'Motion to Dismiss': [example 1], [example 2], [example 3]"
"Classify this document: [new document]"

When few-shot works well:

When you have 5-50 examples per category (the sweet spot)
When categories are semantically distinct
When examples are representative of the category variation
When the task is classification, not generation

When to prefer few-shot over zero-shot:

When zero-shot accuracy is below your threshold
When you have examples available (even a handful)
When category names alone are insufficient to distinguish categories
When you need higher precision on specific categories

Delivery Patterns for Zero-Shot and Few-Shot Systems

Pattern 1: The Zero-Shot Prototype to Few-Shot Production Pipeline

This is the most common delivery pattern for agency work.

Week 1: Zero-shot prototype.

Set up the base model (typically a large language model like GPT-4, Claude, or an open-source alternative)
Define the label taxonomy with the client
Write descriptive label definitions
Run zero-shot classification on a sample of the client's data
Measure accuracy and identify weak categories

Week 2: Few-shot enhancement.

For categories where zero-shot underperforms, collect 10-20 examples from the client
Implement few-shot prompting or embedding-based classification
Measure the improvement
Identify categories that still underperform

Week 3: Hybrid system.

Use few-shot for categories with sufficient examples
Use zero-shot for categories without examples
Add confidence thresholds — route low-confidence predictions to human review
Implement the feedback loop: human-reviewed predictions become new few-shot examples

Week 4-5: Production deployment.

Build the serving infrastructure
Implement monitoring (accuracy tracking, confidence distribution, category distribution)
Set up the human-in-the-loop review interface
Deploy and validate

Total delivery time: 5 weeks. Compare that to 5 months for a traditional supervised learning approach.

Pattern 2: The Embedding-Based Few-Shot System

Best for: classification tasks with many categories and moderate accuracy requirements.

Encode all few-shot examples using a sentence transformer or similar embedding model
Store example embeddings in a vector database (Pinecone, Weaviate, Qdrant)
For new inputs, encode the input and find the nearest example embeddings
Classify based on the majority category of the K nearest examples

Advantages:

Fast inference (embedding + vector search is much cheaper than LLM inference)
Easy to add new categories (just add examples, no retraining)
Scales to thousands of categories
Interpretable — you can show the user which examples the prediction is based on

Disadvantages:

Lower accuracy than LLM-based approaches for complex tasks
Requires good embeddings for the domain
Sensitive to the quality and representativeness of examples

Pattern 3: The Progressive Refinement Pipeline

Best for: clients who will accumulate labeled data over time.

Phase 1 (Week 1-3): Zero-shot deployment.

Deploy a zero-shot system for immediate value
Every prediction is logged with its confidence score
Low-confidence predictions are routed to human review

Phase 2 (Month 1-3): Few-shot enhancement.

As human-reviewed predictions accumulate, use them as few-shot examples
Accuracy improves automatically as the example set grows
Monitor which categories benefit most from additional examples

Phase 3 (Month 3-6): Fine-tuned model.

Once sufficient labeled data has accumulated (typically 500+ examples per category), fine-tune a domain-specific model
The fine-tuned model replaces the few-shot system for high-volume categories
Few-shot continues for low-volume and new categories

Phase 4 (Ongoing): Hybrid system.

Fine-tuned models handle the common, well-represented categories
Few-shot handles uncommon and new categories
Zero-shot handles never-before-seen categories
The system automatically promotes categories from zero-shot to few-shot to fine-tuned as data accumulates

This pattern is excellent for agency recurring revenue. Each phase is a separate engagement, and the client sees continuous improvement that justifies ongoing investment.

Practical Implementation Considerations

Prompt Engineering for Zero-Shot and Few-Shot

Example selection matters enormously for few-shot. The examples you choose shape the model's understanding of each category. Select examples that:

Represent the diversity within the category (short and long, formal and informal, clear-cut and borderline)
Are clearly within the category (avoid borderline examples that could belong to multiple categories)
Cover the most common variations

Cost Management

LLM-based zero-shot and few-shot systems incur per-prediction API costs. For high-volume applications, these costs add up.

Cost optimization strategies:

Caching. If the same or very similar inputs appear frequently, cache the predictions. A simple hash-based cache can reduce API calls by 30-60% for many applications.
Tiered classification. Use a cheap, fast model (embedding-based) as a first pass. Only route uncertain predictions to the expensive LLM.
Batch processing. For non-real-time applications, batch predictions and send them in bulk. Many API providers offer lower per-token costs for batch requests.
Model distillation. Use the LLM's predictions to train a smaller, cheaper model. Once the smaller model reaches acceptable accuracy, switch to it for production inference.

Monitoring and Evaluation

Track accuracy by category. Overall accuracy can mask poor performance on specific categories. Monitor per-category precision and recall.

Track confidence distributions. If the model's confidence scores are drifting downward, it may be encountering inputs that are increasingly different from what it expects.

Run periodic accuracy audits. Sample predictions, have humans label them independently, and compare. This catches gradual degradation that automated metrics might miss.

Pricing Zero-Shot and Few-Shot Projects

Typical pricing:

Zero-shot prototype (1-2 weeks): $15,000 - $30,000
Few-shot production system (3-5 weeks): $40,000 - $80,000
Progressive refinement (ongoing): $5,000 - $10,000 per month
Monitoring and optimization: $3,000 - $6,000 per month

The ongoing costs include API fees. Either pass these through to the client at cost plus a management fee, or include them in the monthly retainer with a volume cap.

Classifying 47 Document Types With Only 200 Examples

Delivering Zero-Shot and Few-Shot Learning Solutions: The Agency Advantage

The Zero-Shot and Few-Shot Revolution for Agency Work

Understanding the Approaches

Zero-Shot Learning

Few-Shot Learning

Delivery Patterns for Zero-Shot and Few-Shot Systems

Pattern 1: The Zero-Shot Prototype to Few-Shot Production Pipeline

Pattern 2: The Embedding-Based Few-Shot System

Pattern 3: The Progressive Refinement Pipeline

Practical Implementation Considerations

Prompt Engineering for Zero-Shot and Few-Shot

Cost Management

Monitoring and Evaluation

Pricing Zero-Shot and Few-Shot Projects

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?

Classifying 47 Document Types With Only 200 Examples

Delivering Zero-Shot and Few-Shot Learning Solutions: The Agency Advantage

The Zero-Shot and Few-Shot Revolution for Agency Work

Understanding the Approaches

Zero-Shot Learning

Few-Shot Learning

Delivery Patterns for Zero-Shot and Few-Shot Systems

Pattern 1: The Zero-Shot Prototype to Few-Shot Production Pipeline

Pattern 2: The Embedding-Based Few-Shot System

Pattern 3: The Progressive Refinement Pipeline

Practical Implementation Considerations

Prompt Engineering for Zero-Shot and Few-Shot

Cost Management

Monitoring and Evaluation

Pricing Zero-Shot and Few-Shot Projects

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?