Delivering Zero-Shot and Few-Shot Learning Solutions: The Agency Advantage
A legal technology startup came to a four-person AI agency in New York with a classification problem: they needed to categorize incoming legal documents into 47 document types โ contracts, depositions, motions, briefs, and 43 others. Traditional supervised learning would have required thousands of labeled examples per category. The client had about 200 labeled documents total, unevenly distributed across categories. Some categories had 15 examples. Others had zero.
The agency's previous approach would have been to tell the client "you need more labeled data" and scope a three-month data labeling project before any model development. Instead, they built a zero-shot classification system using a large language model. The system took a document and a list of category descriptions and predicted the most appropriate category โ without any labeled training examples. Out of the box, it hit 71% accuracy across all 47 categories.
Then they added few-shot learning. For the 20 categories where the client had at least 10 labeled examples, they used those examples to guide the model. Accuracy on those categories jumped to 84%. For the remaining 27 categories with fewer examples, the zero-shot approach held at 71%. The system went into production in five weeks โ not five months.
The client's CTO described it as "magic." It was not magic. It was a fundamental shift in how AI agencies can deliver value: instead of starting with "how much labeled data do you have?" you start with "what problem do you need solved?" and figure out the data requirements after.
The Zero-Shot and Few-Shot Revolution for Agency Work
Zero-shot and few-shot learning fundamentally change the economics of AI agency delivery. Here is why:
Data labeling was your biggest cost center. For traditional supervised learning, you needed thousands to millions of labeled examples. Getting those labels required weeks or months of data preparation. With zero-shot and few-shot approaches, you can deliver working systems with minimal or no labeled data.
Time to value collapses. Instead of a 6-month project (3 months labeling, 2 months training, 1 month deployment), you deliver a working prototype in days and a production system in weeks.
More problems become solvable. Clients with niche classification needs, rare event detection requirements, or rapidly changing categories could not use traditional ML because they could never accumulate enough labeled data. Zero-shot and few-shot approaches make these problems tractable.
You can serve smaller clients. A startup with 500 support tickets cannot train a traditional text classifier. But they can use zero-shot classification to automatically route those tickets today.
Proof of concepts close faster. When you can demonstrate a working solution in a one-week sprint instead of a three-month data labeling initiative, clients commit faster and with less risk.
Understanding the Approaches
Zero-Shot Learning
What it is: Making predictions on categories the model has never been explicitly trained on, using only a description of the category.
How it works in practice: You provide the model with an input (text, image) and a set of candidate labels with descriptions. The model uses its pre-trained knowledge to match the input to the most appropriate label.
Example in natural language:
Input: "The defendant filed a motion requesting the court to compel the plaintiff to produce financial records." Candidate labels: ["Motion to Compel", "Motion to Dismiss", "Summary Judgment", "Discovery Request", "Complaint"] Output: "Motion to Compel" (confidence: 0.89)
The model was never trained on legal documents specifically. It uses its general language understanding to make the classification.
When zero-shot works well:
- Text classification where category names are descriptive
- Sentiment analysis with custom sentiment scales
- Topic categorization
- Intent detection for new, previously undefined intents
- Image classification using CLIP-style models with descriptive labels
When zero-shot struggles:
- Highly technical domains where the category distinctions require specialized knowledge
- Categories that are semantically similar (distinguishing between 10 types of legal motions is harder than distinguishing a motion from a contract)
- Numerical prediction tasks
- Tasks requiring reasoning over structured data
Few-Shot Learning
What it is: Making predictions using only a handful of labeled examples per category โ typically 1 to 50 examples.
How it works in practice: You provide the model with a few examples of each category, then ask it to classify new inputs based on those examples. The examples serve as a "template" for what each category looks like.
Two main implementation approaches:
In-context learning (prompt-based): Include examples directly in the prompt to a large language model. The model uses the examples as context to classify new inputs.
Prompt structure:
- "Here are examples of 'Motion to Compel': [example 1], [example 2], [example 3]"
- "Here are examples of 'Motion to Dismiss': [example 1], [example 2], [example 3]"
- "Classify this document: [new document]"
Metric learning (embedding-based): Encode examples into an embedding space, then classify new inputs based on their proximity to the example embeddings. This is faster at inference time and scales better to many categories.
When few-shot works well:
- When you have 5-50 examples per category (the sweet spot)
- When categories are semantically distinct
- When examples are representative of the category variation
- When the task is classification, not generation
When to prefer few-shot over zero-shot:
- When zero-shot accuracy is below your threshold
- When you have examples available (even a handful)
- When category names alone are insufficient to distinguish categories
- When you need higher precision on specific categories
Delivery Patterns for Zero-Shot and Few-Shot Systems
Pattern 1: The Zero-Shot Prototype to Few-Shot Production Pipeline
This is the most common delivery pattern for agency work.
Week 1: Zero-shot prototype.
- Set up the base model (typically a large language model like GPT-4, Claude, or an open-source alternative)
- Define the label taxonomy with the client
- Write descriptive label definitions
- Run zero-shot classification on a sample of the client's data
- Measure accuracy and identify weak categories
Week 2: Few-shot enhancement.
- For categories where zero-shot underperforms, collect 10-20 examples from the client
- Implement few-shot prompting or embedding-based classification
- Measure the improvement
- Identify categories that still underperform
Week 3: Hybrid system.
- Use few-shot for categories with sufficient examples
- Use zero-shot for categories without examples
- Add confidence thresholds โ route low-confidence predictions to human review
- Implement the feedback loop: human-reviewed predictions become new few-shot examples
Week 4-5: Production deployment.
- Build the serving infrastructure
- Implement monitoring (accuracy tracking, confidence distribution, category distribution)
- Set up the human-in-the-loop review interface
- Deploy and validate
Total delivery time: 5 weeks. Compare that to 5 months for a traditional supervised learning approach.
Pattern 2: The Embedding-Based Few-Shot System
Best for: classification tasks with many categories and moderate accuracy requirements.
- Encode all few-shot examples using a sentence transformer or similar embedding model
- Store example embeddings in a vector database (Pinecone, Weaviate, Qdrant)
- For new inputs, encode the input and find the nearest example embeddings
- Classify based on the majority category of the K nearest examples
Advantages:
- Fast inference (embedding + vector search is much cheaper than LLM inference)
- Easy to add new categories (just add examples, no retraining)
- Scales to thousands of categories
- Interpretable โ you can show the user which examples the prediction is based on
Disadvantages:
- Lower accuracy than LLM-based approaches for complex tasks
- Requires good embeddings for the domain
- Sensitive to the quality and representativeness of examples
Pattern 3: The Progressive Refinement Pipeline
Best for: clients who will accumulate labeled data over time.
Phase 1 (Week 1-3): Zero-shot deployment.
- Deploy a zero-shot system for immediate value
- Every prediction is logged with its confidence score
- Low-confidence predictions are routed to human review
Phase 2 (Month 1-3): Few-shot enhancement.
- As human-reviewed predictions accumulate, use them as few-shot examples
- Accuracy improves automatically as the example set grows
- Monitor which categories benefit most from additional examples
Phase 3 (Month 3-6): Fine-tuned model.
- Once sufficient labeled data has accumulated (typically 500+ examples per category), fine-tune a domain-specific model
- The fine-tuned model replaces the few-shot system for high-volume categories
- Few-shot continues for low-volume and new categories
Phase 4 (Ongoing): Hybrid system.
- Fine-tuned models handle the common, well-represented categories
- Few-shot handles uncommon and new categories
- Zero-shot handles never-before-seen categories
- The system automatically promotes categories from zero-shot to few-shot to fine-tuned as data accumulates
This pattern is excellent for agency recurring revenue. Each phase is a separate engagement, and the client sees continuous improvement that justifies ongoing investment.
Practical Implementation Considerations
Prompt Engineering for Zero-Shot and Few-Shot
Label descriptions matter enormously for zero-shot. "Motion to Compel" is okay. "A legal motion filed by one party to force the opposing party to comply with a discovery request or court order" is much better. Invest time in writing precise, descriptive label definitions.
Example selection matters enormously for few-shot. The examples you choose shape the model's understanding of each category. Select examples that:
- Represent the diversity within the category (short and long, formal and informal, clear-cut and borderline)
- Are clearly within the category (avoid borderline examples that could belong to multiple categories)
- Cover the most common variations
Example ordering can affect results. Recency bias in language models means the last few examples in the prompt get disproportionate weight. Randomize example order across predictions to reduce this effect.
Prompt formatting affects accuracy. Structured prompts with clear delimiters between examples and consistent formatting outperform unstructured prompts. Use explicit separators, numbered examples, and consistent label formatting.
Cost Management
LLM-based zero-shot and few-shot systems incur per-prediction API costs. For high-volume applications, these costs add up.
Cost optimization strategies:
- Caching. If the same or very similar inputs appear frequently, cache the predictions. A simple hash-based cache can reduce API calls by 30-60% for many applications.
- Tiered classification. Use a cheap, fast model (embedding-based) as a first pass. Only route uncertain predictions to the expensive LLM.
- Batch processing. For non-real-time applications, batch predictions and send them in bulk. Many API providers offer lower per-token costs for batch requests.
- Model distillation. Use the LLM's predictions to train a smaller, cheaper model. Once the smaller model reaches acceptable accuracy, switch to it for production inference.
Monitoring and Evaluation
Track accuracy by category. Overall accuracy can mask poor performance on specific categories. Monitor per-category precision and recall.
Track confidence distributions. If the model's confidence scores are drifting downward, it may be encountering inputs that are increasingly different from what it expects.
Track human override rates. If human reviewers frequently override the model's predictions for a specific category, the model is not performing well on that category. Use the overrides as new examples to improve.
Run periodic accuracy audits. Sample predictions, have humans label them independently, and compare. This catches gradual degradation that automated metrics might miss.
Pricing Zero-Shot and Few-Shot Projects
Do not price these projects based on the time they take to build. A zero-shot system that takes two weeks to deploy solves the same business problem as a traditional system that takes five months. Price on value, not effort.
Typical pricing:
- Zero-shot prototype (1-2 weeks): $15,000 - $30,000
- Few-shot production system (3-5 weeks): $40,000 - $80,000
- Progressive refinement (ongoing): $5,000 - $10,000 per month
- Monitoring and optimization: $3,000 - $6,000 per month
The ongoing costs include API fees. Either pass these through to the client at cost plus a management fee, or include them in the monthly retainer with a volume cap.
Value framing: "This system categorizes 10,000 documents per month that would otherwise require 2 full-time paralegals at $65,000 each. The system costs $8,000 per month including API fees and maintenance. That is a 94% cost reduction with faster turnaround."
Your Next Step
Identify one client project where labeled data is the bottleneck โ either you do not have enough labeled data to train a traditional model, or the time to label data is delaying the project. Build a zero-shot prototype using an LLM API. Evaluate it on whatever labeled data you do have. If it hits 70% or better accuracy, you have a viable starting point that can be deployed in weeks and improved over time. If it hits 60% or below, try adding 10-20 examples per category for the underperforming classes. The improvement from zero-shot to few-shot is often dramatic enough to cross the viability threshold.