A legal AI agency in San Francisco was hired by a mid-size law firm to build a contract analysis system that could review non-disclosure agreements, identify non-standard clauses, and flag potential risks. Their initial approach used GPT-4 with carefully engineered prompts and few-shot examples. It worked well โ 89% accuracy on clause identification โ but the per-query cost was $0.12 and the latency was 4-8 seconds per contract, making it impractical for the firm's volume of 2,400 contracts per month. The agency fine-tuned a Llama 3 8B model on 6,200 annotated contracts from the firm's historical files. The fine-tuned model achieved 93% accuracy โ outperforming GPT-4 on this specific task โ with a per-query cost of $0.006 and latency under 800 milliseconds. The model ran on a single A10G GPU that cost $0.30 per hour. The firm saved an estimated $280,000 annually in analysis costs while getting faster and more accurate results.
Fine-tuning large language models takes a general-purpose model and adapts it to excel at a specific task or domain by training it on domain-specific data. For AI agencies, fine-tuning is the technique that bridges the gap between generic LLM capabilities and production-grade domain performance โ delivering models that are more accurate, faster, cheaper, and more controllable than prompting general-purpose APIs for specialized tasks.
When to Fine-Tune vs. When to Prompt
The Decision Framework
Fine-tuning is not always the right choice. Sometimes prompt engineering with a large model is sufficient. The decision depends on five factors.
Fine-tune when:
- The task requires consistent, structured output in a specific format
- Domain-specific terminology or knowledge significantly affects accuracy
- Per-query cost matters because volume is high (thousands of queries per day)
- Latency matters because users or systems are waiting for responses
- Data privacy requires running the model on your own infrastructure
- The task is well-defined and the model needs to perform the same type of analysis repeatedly
Stick with prompting when:
- The task varies significantly from query to query (highly creative or open-ended)
- You have fewer than 500 training examples
- The domain changes rapidly and retraining would be needed frequently
- The development timeline is very short (days, not weeks)
- The task requires the model's broadest general knowledge
Hybrid approach:
- Fine-tune a smaller model for the core task (structured extraction, classification, domain-specific generation)
- Use a larger prompted model for edge cases, quality assurance, or tasks that require broader reasoning
- Route queries to the appropriate model based on complexity or confidence
Cost-Benefit Analysis
Fine-tuning costs:
- Training data preparation and annotation: $5,000-50,000 depending on volume and complexity
- Compute for training: $100-5,000 per training run (depends on model size and dataset size)
- Infrastructure for serving: $500-5,000 per month (depends on throughput requirements)
- Ongoing maintenance and retraining: $2,000-10,000 per quarter
Fine-tuning savings (compared to prompting a frontier model):
- Per-query cost reduction: Typically 10-50x cheaper than GPT-4/Claude for the same task
- Latency reduction: 2-10x faster response times
- Quality improvement: 3-15% accuracy improvement on domain-specific tasks
- Control improvement: Consistent output format, predictable behavior, no API dependency
Break-even point: Fine-tuning typically pays for itself within 1-3 months for applications processing more than 1,000 queries per day.
Choosing a Base Model
Model Size Selection
7-8B parameter models (Llama 3 8B, Mistral 7B, Gemma 7B):
- Run on a single A10G or A100 GPU
- Fine-tune with 16-24GB VRAM using LoRA/QLoRA
- Excellent for focused, well-defined tasks (classification, extraction, structured generation)
- Inference: 50-200 tokens per second on A10G
- Best for: Most agency fine-tuning projects
13-14B parameter models (Llama 3.1 13B variants):
- Run on a single A100 80GB GPU
- Stronger reasoning and generation quality than 7B models
- Fine-tune with 40-80GB VRAM
- Inference: 30-100 tokens per second on A100
- Best for: Tasks requiring stronger reasoning or longer generation
70B parameter models (Llama 3.1 70B):
- Require multi-GPU serving (2-4 A100 GPUs)
- Closest to frontier model quality in open-source
- Fine-tuning requires 4-8 GPUs
- Inference: 10-40 tokens per second on 4x A100
- Best for: Tasks where quality is paramount and infrastructure cost is justified
Recommendation for most agency projects: Start with a 7-8B model. If it does not meet accuracy targets after thorough fine-tuning, scale up to 13B. Only move to 70B if the 13B model is still insufficient โ the infrastructure and cost differences are significant.
Base Model Selection
Llama 3 / Llama 3.1 (Meta): The default choice for fine-tuning. Strong base quality, permissive license (Meta's community license), extensive fine-tuning documentation and tooling.
Mistral 7B / Mixtral 8x7B (Mistral AI): Competitive quality with Llama 3, excellent for multilingual applications. Mixtral provides mixture-of-experts architecture for better quality at similar inference cost.
Gemma 2 (Google): Strong quality, good for applications in the Google ecosystem. More restrictive license than Llama.
Qwen 2.5 (Alibaba): Excellent multilingual capabilities, particularly for Asian languages. Strong code understanding.
Phi-3 (Microsoft): Small models (3.8B) with surprisingly strong capabilities. Best choice when inference cost is the primary constraint.
Training Data Preparation
Data Format
Fine-tuning data is typically formatted as input-output pairs that demonstrate the desired model behavior.
Instruction tuning format:
Each training example consists of:
- An instruction describing the task
- An input providing the specific data or context
- An output showing the desired response
Conversation format:
For models that need to engage in multi-turn interactions:
- A sequence of user messages and assistant responses
- The model learns to generate the assistant responses given the conversation history
Quality over quantity:
1,000 high-quality, diverse, accurately labeled examples typically produce better results than 10,000 noisy examples. Invest in data quality.
Data Collection Strategies
Expert annotation:
- Have domain experts create training examples that demonstrate ideal model behavior
- For each example, include not just the correct output but also the reasoning that produced it
- Target 500-2,000 expert-created examples for initial fine-tuning
Historical data mining:
- Extract input-output pairs from the client's historical workflows (analyst reports, document reviews, support ticket resolutions)
- Clean and standardize the format
- Have experts validate a sample to ensure quality
LLM-assisted data generation:
- Use a frontier model (GPT-4, Claude) to generate draft training examples
- Have human experts review and correct each example
- This approach is 3-5x faster than creating examples from scratch
- Verify that the generated examples are diverse and cover edge cases
Active learning for data selection:
- Train an initial model on a small dataset
- Use the model to process unlabeled data
- Select examples where the model is most uncertain or incorrect
- Have experts label these examples
- Retrain with the expanded dataset
Data Quality Assurance
Consistency checks:
- Review all training examples for consistency โ the same input should always produce the same (or equivalent) output
- Remove duplicates and near-duplicates
- Check for contradictions between examples
Edge case coverage:
- Ensure the training data includes edge cases the model will encounter in production
- Include examples of inputs where the correct behavior is to say "I don't know" or "This input is outside my scope"
- Include examples of malformed or ambiguous inputs with correct handling
Data decontamination:
- Ensure no test set examples appear in the training data
- If using LLM-generated training data, verify that the examples are not memorized from the LLM's training data
Fine-Tuning Techniques
LoRA (Low-Rank Adaptation)
LoRA is the standard fine-tuning technique for production LLM projects. Instead of updating all model parameters, LoRA adds small trainable matrices to specific layers, dramatically reducing the number of parameters trained and the memory required.
LoRA configuration:
- Rank (r): The rank of the low-rank matrices. Higher rank captures more complex adaptations but uses more memory. Start with r=16, increase to r=32 or r=64 if accuracy is insufficient.
- Alpha: Scaling factor for the LoRA updates. Common setting: alpha = 2 * r.
- Target modules: Which layers to apply LoRA to. For most LLMs, apply to the attention query, key, value, and output projection layers. Adding the MLP layers can help for more complex adaptations.
- Dropout: Apply dropout (0.05-0.1) to LoRA layers for regularization.
LoRA advantages:
- Fine-tunes with 10-100x less GPU memory than full fine-tuning
- Training is 5-10x faster than full fine-tuning
- LoRA adapters can be merged with the base model for zero-overhead inference
- Multiple LoRA adapters can be trained for different tasks and swapped at serving time
QLoRA (Quantized LoRA)
QLoRA combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of large models on consumer-grade GPUs.
When to use QLoRA:
- You need to fine-tune on hardware with limited VRAM (24GB or less)
- The base model is too large to fit in memory for standard LoRA
- You are willing to accept a slight accuracy reduction (typically 0.5-2%) for significantly reduced memory requirements
QLoRA configuration:
- Quantize the base model to 4-bit NormalFloat (NF4) precision
- Apply LoRA adapters on top of the quantized model
- Use double quantization to further reduce memory usage
- Compute in BF16 for numerical stability
Training Configuration
Hyperparameters for LLM fine-tuning:
- Learning rate: 1e-4 to 3e-4 for LoRA fine-tuning. Lower than pre-training because we want to adapt, not overwrite.
- Batch size: 4-16 depending on GPU memory. Use gradient accumulation to simulate larger batch sizes.
- Epochs: 2-5 for most fine-tuning tasks. More epochs risk overfitting to the training data. Monitor validation loss closely.
- Warmup: 3-10% of total training steps
- Weight decay: 0.01
- Max sequence length: Set to the longest expected input + output. Pad shorter sequences. Truncate sequences that exceed the model's context window.
- Scheduler: Cosine annealing with warmup
Overfitting prevention:
- Monitor validation loss after each epoch โ stop training when validation loss starts increasing
- Use early stopping with patience of 1-2 epochs
- Apply LoRA dropout (0.05-0.1)
- Ensure training data diversity โ no more than 3-5 examples of any single pattern
- If overfitting persists, reduce LoRA rank or collect more diverse training data
Training Frameworks
Hugging Face TRL (Transformer Reinforcement Learning): The most popular framework for LLM fine-tuning. Supports SFT (supervised fine-tuning), DPO (direct preference optimization), and RLHF. Integrates with the Hugging Face ecosystem.
Axolotl: Configuration-driven fine-tuning framework that simplifies the training setup. Good for teams that want to fine-tune without writing custom training code.
LLaMA-Factory: Comprehensive fine-tuning framework supporting multiple training methods (full fine-tuning, LoRA, QLoRA, RLHF, DPO) with a web UI for configuration.
Unsloth: Optimized fine-tuning library that provides 2x faster training and 60% less memory usage through custom CUDA kernels. Excellent for resource-constrained environments.
Evaluation
Task-Specific Evaluation
Design evaluation metrics specific to the fine-tuned model's task.
For classification tasks:
- Accuracy, precision, recall, F1 per class
- Confusion matrix
- Compare to the base model and to the prompted frontier model
For extraction tasks:
- Exact match rate (extracted value matches ground truth exactly)
- Partial match rate (extracted value overlaps with ground truth)
- Per-field accuracy
For generation tasks:
- Human evaluation on a 5-point scale (accuracy, relevance, completeness, format compliance)
- Automated metrics (ROUGE, BERTScore) as development proxies
- A/B comparison against the base model and the prompted frontier model
Regression Testing
After fine-tuning, verify that the model has not lost important general capabilities.
Regression tests:
- Run the fine-tuned model on a set of general-knowledge questions and verify it still provides reasonable answers
- Test on edge cases outside the training data distribution and verify the model does not hallucinate or produce nonsensical output
- Compare the fine-tuned model's general capabilities to the base model using a standard benchmark
Human Evaluation Protocol
For production deployment decisions, human evaluation is essential.
Evaluation protocol:
- Select 100-200 test examples not included in training
- Generate outputs from the fine-tuned model, the base model, and the prompted frontier model
- Present outputs to domain experts without revealing which model produced each output
- Have experts rate each output on accuracy, completeness, and format compliance
- Compute win rates: how often does the fine-tuned model produce the best output?
Deployment criteria:
- Fine-tuned model must achieve win rate above 60% against the base model
- Fine-tuned model must achieve win rate above 40% against the prompted frontier model (acceptable if the cost and latency advantages justify the quality difference)
- No critical failures (factually incorrect outputs on high-stakes inputs)
Production Deployment
Serving Infrastructure
vLLM: The standard serving framework for production LLM inference. Provides continuous batching, PagedAttention for efficient memory management, and high throughput. Supports LoRA adapter loading and switching.
Text Generation Inference (TGI): Hugging Face's serving solution. Good quality, strong community support, integrates with the HF ecosystem.
TensorRT-LLM: NVIDIA's optimized inference engine. Provides the highest throughput on NVIDIA GPUs through aggressive kernel optimization and quantization.
Deployment patterns:
- Single-model serving: Deploy the fine-tuned model on dedicated GPU instances. Simplest setup, suitable for single-client deployments.
- Multi-LoRA serving: Deploy the base model once and load different LoRA adapters per request. Efficient for agencies serving multiple clients with different fine-tuned models on shared infrastructure.
- Autoscaling: Scale GPU instances based on request rate. Use minimum instances for baseline traffic and scale up for peak loads.
Model Versioning and Rollback
- Store each fine-tuned model version (LoRA adapter + base model reference) in the model registry
- Deploy new versions behind a canary release (5-10% of traffic initially)
- Monitor quality metrics during canary phase
- Roll back if quality metrics degrade
- Keep the previous version loaded and ready for instant rollback
Monitoring
Quality monitoring:
- Log all inputs and outputs for quality review
- Sample 2-5% of production outputs for human evaluation
- Track the distribution of output lengths, formats, and confidence indicators
- Monitor for hallucination signals (outputs that are inconsistent with the input)
Performance monitoring:
- Inference latency (p50, p95, p99)
- Throughput (tokens per second, requests per second)
- GPU utilization and memory usage
- Request queue depth (indicates capacity issues)
Ongoing Maintenance
Retraining Schedule
- Monthly: Evaluate model performance on new test data. If accuracy has degraded, investigate and retrain.
- Quarterly: Collect new training data from production (human-reviewed outputs, new edge cases) and retrain.
- On-demand: Retrain when the client's domain changes (new document types, new terminology, new requirements).
Training Data Growth
Build a feedback loop that continuously improves the training dataset.
- Capture human corrections to model outputs as new training examples
- Mine production outputs for edge cases and errors
- Periodically retrain on the expanded dataset
- Track training data size and model accuracy over time โ accuracy should improve with each retraining cycle
Your Next Step
Take the task you are considering for fine-tuning. Create 50 high-quality input-output examples that demonstrate exactly the model behavior you want. Split them 40/10 (training/evaluation). Fine-tune a 7B model (Llama 3 8B via QLoRA) on the 40 examples. Evaluate on the 10 held-out examples and compare to the base model and to GPT-4 with a detailed prompt. This experiment takes half a day and answers the most important question: does fine-tuning provide meaningful improvement for this specific task? If 50 examples produce noticeable improvement, 500-2,000 examples will produce substantial improvement. If 50 examples show no improvement, the task may not be a good fit for fine-tuning, or the examples may not be demonstrating the right behavior. Either way, you have learned something essential before committing to a full fine-tuning project.