AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

When to Fine-Tune vs. When to PromptThe Decision FrameworkCost-Benefit AnalysisChoosing a Base ModelModel Size SelectionBase Model SelectionTraining Data PreparationData FormatData Collection StrategiesData Quality AssuranceFine-Tuning TechniquesLoRA (Low-Rank Adaptation)QLoRA (Quantized LoRA)Training ConfigurationTraining FrameworksEvaluationTask-Specific EvaluationRegression TestingHuman Evaluation ProtocolProduction DeploymentServing InfrastructureModel Versioning and RollbackMonitoringOngoing MaintenanceRetraining ScheduleTraining Data GrowthYour Next Step
Home/Blog/Fine-Tuning LLMs for Enterprise Use Cases โ€” From Base Models to Domain-Specific Production Systems
Delivery

Fine-Tuning LLMs for Enterprise Use Cases โ€” From Base Models to Domain-Specific Production Systems

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท12 min read
llm fine-tuninglarge language modelsdomain adaptationenterprise ai

A legal AI agency in San Francisco was hired by a mid-size law firm to build a contract analysis system that could review non-disclosure agreements, identify non-standard clauses, and flag potential risks. Their initial approach used GPT-4 with carefully engineered prompts and few-shot examples. It worked well โ€” 89% accuracy on clause identification โ€” but the per-query cost was $0.12 and the latency was 4-8 seconds per contract, making it impractical for the firm's volume of 2,400 contracts per month. The agency fine-tuned a Llama 3 8B model on 6,200 annotated contracts from the firm's historical files. The fine-tuned model achieved 93% accuracy โ€” outperforming GPT-4 on this specific task โ€” with a per-query cost of $0.006 and latency under 800 milliseconds. The model ran on a single A10G GPU that cost $0.30 per hour. The firm saved an estimated $280,000 annually in analysis costs while getting faster and more accurate results.

Fine-tuning large language models takes a general-purpose model and adapts it to excel at a specific task or domain by training it on domain-specific data. For AI agencies, fine-tuning is the technique that bridges the gap between generic LLM capabilities and production-grade domain performance โ€” delivering models that are more accurate, faster, cheaper, and more controllable than prompting general-purpose APIs for specialized tasks.

When to Fine-Tune vs. When to Prompt

The Decision Framework

Fine-tuning is not always the right choice. Sometimes prompt engineering with a large model is sufficient. The decision depends on five factors.

Fine-tune when:

  • The task requires consistent, structured output in a specific format
  • Domain-specific terminology or knowledge significantly affects accuracy
  • Per-query cost matters because volume is high (thousands of queries per day)
  • Latency matters because users or systems are waiting for responses
  • Data privacy requires running the model on your own infrastructure
  • The task is well-defined and the model needs to perform the same type of analysis repeatedly

Stick with prompting when:

  • The task varies significantly from query to query (highly creative or open-ended)
  • You have fewer than 500 training examples
  • The domain changes rapidly and retraining would be needed frequently
  • The development timeline is very short (days, not weeks)
  • The task requires the model's broadest general knowledge

Hybrid approach:

  • Fine-tune a smaller model for the core task (structured extraction, classification, domain-specific generation)
  • Use a larger prompted model for edge cases, quality assurance, or tasks that require broader reasoning
  • Route queries to the appropriate model based on complexity or confidence

Cost-Benefit Analysis

Fine-tuning costs:

  • Training data preparation and annotation: $5,000-50,000 depending on volume and complexity
  • Compute for training: $100-5,000 per training run (depends on model size and dataset size)
  • Infrastructure for serving: $500-5,000 per month (depends on throughput requirements)
  • Ongoing maintenance and retraining: $2,000-10,000 per quarter

Fine-tuning savings (compared to prompting a frontier model):

  • Per-query cost reduction: Typically 10-50x cheaper than GPT-4/Claude for the same task
  • Latency reduction: 2-10x faster response times
  • Quality improvement: 3-15% accuracy improvement on domain-specific tasks
  • Control improvement: Consistent output format, predictable behavior, no API dependency

Break-even point: Fine-tuning typically pays for itself within 1-3 months for applications processing more than 1,000 queries per day.

Choosing a Base Model

Model Size Selection

7-8B parameter models (Llama 3 8B, Mistral 7B, Gemma 7B):

  • Run on a single A10G or A100 GPU
  • Fine-tune with 16-24GB VRAM using LoRA/QLoRA
  • Excellent for focused, well-defined tasks (classification, extraction, structured generation)
  • Inference: 50-200 tokens per second on A10G
  • Best for: Most agency fine-tuning projects

13-14B parameter models (Llama 3.1 13B variants):

  • Run on a single A100 80GB GPU
  • Stronger reasoning and generation quality than 7B models
  • Fine-tune with 40-80GB VRAM
  • Inference: 30-100 tokens per second on A100
  • Best for: Tasks requiring stronger reasoning or longer generation

70B parameter models (Llama 3.1 70B):

  • Require multi-GPU serving (2-4 A100 GPUs)
  • Closest to frontier model quality in open-source
  • Fine-tuning requires 4-8 GPUs
  • Inference: 10-40 tokens per second on 4x A100
  • Best for: Tasks where quality is paramount and infrastructure cost is justified

Recommendation for most agency projects: Start with a 7-8B model. If it does not meet accuracy targets after thorough fine-tuning, scale up to 13B. Only move to 70B if the 13B model is still insufficient โ€” the infrastructure and cost differences are significant.

Base Model Selection

Llama 3 / Llama 3.1 (Meta): The default choice for fine-tuning. Strong base quality, permissive license (Meta's community license), extensive fine-tuning documentation and tooling.

Mistral 7B / Mixtral 8x7B (Mistral AI): Competitive quality with Llama 3, excellent for multilingual applications. Mixtral provides mixture-of-experts architecture for better quality at similar inference cost.

Gemma 2 (Google): Strong quality, good for applications in the Google ecosystem. More restrictive license than Llama.

Qwen 2.5 (Alibaba): Excellent multilingual capabilities, particularly for Asian languages. Strong code understanding.

Phi-3 (Microsoft): Small models (3.8B) with surprisingly strong capabilities. Best choice when inference cost is the primary constraint.

Training Data Preparation

Data Format

Fine-tuning data is typically formatted as input-output pairs that demonstrate the desired model behavior.

Instruction tuning format:

Each training example consists of:

  • An instruction describing the task
  • An input providing the specific data or context
  • An output showing the desired response

Conversation format:

For models that need to engage in multi-turn interactions:

  • A sequence of user messages and assistant responses
  • The model learns to generate the assistant responses given the conversation history

Quality over quantity:

1,000 high-quality, diverse, accurately labeled examples typically produce better results than 10,000 noisy examples. Invest in data quality.

Data Collection Strategies

Expert annotation:

  • Have domain experts create training examples that demonstrate ideal model behavior
  • For each example, include not just the correct output but also the reasoning that produced it
  • Target 500-2,000 expert-created examples for initial fine-tuning

Historical data mining:

  • Extract input-output pairs from the client's historical workflows (analyst reports, document reviews, support ticket resolutions)
  • Clean and standardize the format
  • Have experts validate a sample to ensure quality

LLM-assisted data generation:

  • Use a frontier model (GPT-4, Claude) to generate draft training examples
  • Have human experts review and correct each example
  • This approach is 3-5x faster than creating examples from scratch
  • Verify that the generated examples are diverse and cover edge cases

Active learning for data selection:

  • Train an initial model on a small dataset
  • Use the model to process unlabeled data
  • Select examples where the model is most uncertain or incorrect
  • Have experts label these examples
  • Retrain with the expanded dataset

Data Quality Assurance

Consistency checks:

  • Review all training examples for consistency โ€” the same input should always produce the same (or equivalent) output
  • Remove duplicates and near-duplicates
  • Check for contradictions between examples

Edge case coverage:

  • Ensure the training data includes edge cases the model will encounter in production
  • Include examples of inputs where the correct behavior is to say "I don't know" or "This input is outside my scope"
  • Include examples of malformed or ambiguous inputs with correct handling

Data decontamination:

  • Ensure no test set examples appear in the training data
  • If using LLM-generated training data, verify that the examples are not memorized from the LLM's training data

Fine-Tuning Techniques

LoRA (Low-Rank Adaptation)

LoRA is the standard fine-tuning technique for production LLM projects. Instead of updating all model parameters, LoRA adds small trainable matrices to specific layers, dramatically reducing the number of parameters trained and the memory required.

LoRA configuration:

  • Rank (r): The rank of the low-rank matrices. Higher rank captures more complex adaptations but uses more memory. Start with r=16, increase to r=32 or r=64 if accuracy is insufficient.
  • Alpha: Scaling factor for the LoRA updates. Common setting: alpha = 2 * r.
  • Target modules: Which layers to apply LoRA to. For most LLMs, apply to the attention query, key, value, and output projection layers. Adding the MLP layers can help for more complex adaptations.
  • Dropout: Apply dropout (0.05-0.1) to LoRA layers for regularization.

LoRA advantages:

  • Fine-tunes with 10-100x less GPU memory than full fine-tuning
  • Training is 5-10x faster than full fine-tuning
  • LoRA adapters can be merged with the base model for zero-overhead inference
  • Multiple LoRA adapters can be trained for different tasks and swapped at serving time

QLoRA (Quantized LoRA)

QLoRA combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of large models on consumer-grade GPUs.

When to use QLoRA:

  • You need to fine-tune on hardware with limited VRAM (24GB or less)
  • The base model is too large to fit in memory for standard LoRA
  • You are willing to accept a slight accuracy reduction (typically 0.5-2%) for significantly reduced memory requirements

QLoRA configuration:

  • Quantize the base model to 4-bit NormalFloat (NF4) precision
  • Apply LoRA adapters on top of the quantized model
  • Use double quantization to further reduce memory usage
  • Compute in BF16 for numerical stability

Training Configuration

Hyperparameters for LLM fine-tuning:

  • Learning rate: 1e-4 to 3e-4 for LoRA fine-tuning. Lower than pre-training because we want to adapt, not overwrite.
  • Batch size: 4-16 depending on GPU memory. Use gradient accumulation to simulate larger batch sizes.
  • Epochs: 2-5 for most fine-tuning tasks. More epochs risk overfitting to the training data. Monitor validation loss closely.
  • Warmup: 3-10% of total training steps
  • Weight decay: 0.01
  • Max sequence length: Set to the longest expected input + output. Pad shorter sequences. Truncate sequences that exceed the model's context window.
  • Scheduler: Cosine annealing with warmup

Overfitting prevention:

  • Monitor validation loss after each epoch โ€” stop training when validation loss starts increasing
  • Use early stopping with patience of 1-2 epochs
  • Apply LoRA dropout (0.05-0.1)
  • Ensure training data diversity โ€” no more than 3-5 examples of any single pattern
  • If overfitting persists, reduce LoRA rank or collect more diverse training data

Training Frameworks

Hugging Face TRL (Transformer Reinforcement Learning): The most popular framework for LLM fine-tuning. Supports SFT (supervised fine-tuning), DPO (direct preference optimization), and RLHF. Integrates with the Hugging Face ecosystem.

Axolotl: Configuration-driven fine-tuning framework that simplifies the training setup. Good for teams that want to fine-tune without writing custom training code.

LLaMA-Factory: Comprehensive fine-tuning framework supporting multiple training methods (full fine-tuning, LoRA, QLoRA, RLHF, DPO) with a web UI for configuration.

Unsloth: Optimized fine-tuning library that provides 2x faster training and 60% less memory usage through custom CUDA kernels. Excellent for resource-constrained environments.

Evaluation

Task-Specific Evaluation

Design evaluation metrics specific to the fine-tuned model's task.

For classification tasks:

  • Accuracy, precision, recall, F1 per class
  • Confusion matrix
  • Compare to the base model and to the prompted frontier model

For extraction tasks:

  • Exact match rate (extracted value matches ground truth exactly)
  • Partial match rate (extracted value overlaps with ground truth)
  • Per-field accuracy

For generation tasks:

  • Human evaluation on a 5-point scale (accuracy, relevance, completeness, format compliance)
  • Automated metrics (ROUGE, BERTScore) as development proxies
  • A/B comparison against the base model and the prompted frontier model

Regression Testing

After fine-tuning, verify that the model has not lost important general capabilities.

Regression tests:

  • Run the fine-tuned model on a set of general-knowledge questions and verify it still provides reasonable answers
  • Test on edge cases outside the training data distribution and verify the model does not hallucinate or produce nonsensical output
  • Compare the fine-tuned model's general capabilities to the base model using a standard benchmark

Human Evaluation Protocol

For production deployment decisions, human evaluation is essential.

Evaluation protocol:

  1. Select 100-200 test examples not included in training
  2. Generate outputs from the fine-tuned model, the base model, and the prompted frontier model
  3. Present outputs to domain experts without revealing which model produced each output
  4. Have experts rate each output on accuracy, completeness, and format compliance
  5. Compute win rates: how often does the fine-tuned model produce the best output?

Deployment criteria:

  • Fine-tuned model must achieve win rate above 60% against the base model
  • Fine-tuned model must achieve win rate above 40% against the prompted frontier model (acceptable if the cost and latency advantages justify the quality difference)
  • No critical failures (factually incorrect outputs on high-stakes inputs)

Production Deployment

Serving Infrastructure

vLLM: The standard serving framework for production LLM inference. Provides continuous batching, PagedAttention for efficient memory management, and high throughput. Supports LoRA adapter loading and switching.

Text Generation Inference (TGI): Hugging Face's serving solution. Good quality, strong community support, integrates with the HF ecosystem.

TensorRT-LLM: NVIDIA's optimized inference engine. Provides the highest throughput on NVIDIA GPUs through aggressive kernel optimization and quantization.

Deployment patterns:

  • Single-model serving: Deploy the fine-tuned model on dedicated GPU instances. Simplest setup, suitable for single-client deployments.
  • Multi-LoRA serving: Deploy the base model once and load different LoRA adapters per request. Efficient for agencies serving multiple clients with different fine-tuned models on shared infrastructure.
  • Autoscaling: Scale GPU instances based on request rate. Use minimum instances for baseline traffic and scale up for peak loads.

Model Versioning and Rollback

  • Store each fine-tuned model version (LoRA adapter + base model reference) in the model registry
  • Deploy new versions behind a canary release (5-10% of traffic initially)
  • Monitor quality metrics during canary phase
  • Roll back if quality metrics degrade
  • Keep the previous version loaded and ready for instant rollback

Monitoring

Quality monitoring:

  • Log all inputs and outputs for quality review
  • Sample 2-5% of production outputs for human evaluation
  • Track the distribution of output lengths, formats, and confidence indicators
  • Monitor for hallucination signals (outputs that are inconsistent with the input)

Performance monitoring:

  • Inference latency (p50, p95, p99)
  • Throughput (tokens per second, requests per second)
  • GPU utilization and memory usage
  • Request queue depth (indicates capacity issues)

Ongoing Maintenance

Retraining Schedule

  • Monthly: Evaluate model performance on new test data. If accuracy has degraded, investigate and retrain.
  • Quarterly: Collect new training data from production (human-reviewed outputs, new edge cases) and retrain.
  • On-demand: Retrain when the client's domain changes (new document types, new terminology, new requirements).

Training Data Growth

Build a feedback loop that continuously improves the training dataset.

  • Capture human corrections to model outputs as new training examples
  • Mine production outputs for edge cases and errors
  • Periodically retrain on the expanded dataset
  • Track training data size and model accuracy over time โ€” accuracy should improve with each retraining cycle

Your Next Step

Take the task you are considering for fine-tuning. Create 50 high-quality input-output examples that demonstrate exactly the model behavior you want. Split them 40/10 (training/evaluation). Fine-tune a 7B model (Llama 3 8B via QLoRA) on the 40 examples. Evaluate on the 10 held-out examples and compare to the base model and to GPT-4 with a detailed prompt. This experiment takes half a day and answers the most important question: does fine-tuning provide meaningful improvement for this specific task? If 50 examples produce noticeable improvement, 500-2,000 examples will produce substantial improvement. If 50 examples show no improvement, the task may not be a good fit for fine-tuning, or the examples may not be demonstrating the right behavior. Either way, you have learned something essential before committing to a full fine-tuning project.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification