Evaluating LLM Performance for Client Deployments: Frameworks That Actually Work
A six-person AI agency in Seattle built a customer support chatbot for a fintech client using GPT-4. The demo was flawless โ the bot handled account inquiries, explained fee structures, and escalated complex issues smoothly. The client approved production deployment. Within 72 hours, the bot had told three customers their account balances incorrectly, fabricated a refund policy that did not exist, and recommended a competitor's product when asked about investment options. The client pulled the bot from production and demanded an explanation.
The agency had tested the bot with 50 hand-picked queries. It had not tested for hallucination rates, edge cases, adversarial inputs, or consistency across rephrased questions. It had not established baseline metrics for accuracy, safety, or brand alignment. It had, essentially, deployed an LLM without evaluation.
Six months later, after building a comprehensive evaluation framework, the agency re-deployed the bot with 200+ automated test cases, hallucination detection, safety guardrails, and continuous monitoring. The bot handled 15,000 conversations in its first month with a 4.2/5 customer satisfaction score and zero critical failures. The evaluation framework โ not the LLM itself โ was the difference between a disaster and a success.
LLM evaluation is the most critical and most under-invested capability in the AI agency space right now. Every agency is building with LLMs. Very few are evaluating them rigorously.
Why LLM Evaluation Is Different From Traditional ML Evaluation
Traditional ML evaluation is relatively straightforward: compute accuracy, precision, recall, and F1 on a labeled test set. LLM evaluation is fundamentally harder for several reasons:
Outputs are unstructured. A classification model produces one of N labels. An LLM produces free-form text. "Correct" is not a binary judgment โ it is a spectrum from "perfectly helpful" to "dangerously wrong" with many shades in between.
There is no single ground truth. For the question "How do I reset my password?" there are dozens of correct responses with different wording, different levels of detail, and different tones. Traditional exact-match metrics fail.
Failure modes are diverse. An LLM can fail by being inaccurate, unhelpful, unsafe, off-brand, too verbose, too terse, hallucinatory, biased, or just confusing. Each failure mode requires its own evaluation approach.
Performance varies non-deterministically. The same prompt can produce different outputs on different runs (with temperature > 0). Evaluation must account for this variability.
Scale and cost make exhaustive testing impractical. You cannot test every possible input. You need strategic test design that covers the most important scenarios efficiently.
The Evaluation Framework
Dimension 1: Correctness
Does the LLM provide factually accurate information?
Automated evaluation approaches:
- Fact extraction and verification. Extract factual claims from the LLM's response and verify each against a knowledge base or ground truth database. Tools like FActScore automate this for fact-dense responses.
- Reference-based scoring. Compare the LLM's response against a gold-standard reference response using semantic similarity metrics (BERTScore, BLEURT, or LLM-as-judge). This works well when reference responses exist.
- Self-consistency checks. Ask the same question multiple ways and check whether the LLM gives consistent answers. Inconsistency suggests the model is guessing rather than relying on reliable knowledge.
Human evaluation approaches:
- Domain expert review. Have subject matter experts rate a sample of responses for accuracy on a 1-5 scale. This is the gold standard but expensive and slow.
- Comparative evaluation. Show experts two responses (from different models or different prompts) and ask which is more accurate. This is easier and more reliable than absolute scoring.
For agency work, combine automated fact verification on the full test set with expert review on a smaller sample. The automated checks catch obvious errors at scale; the expert review validates subtleties that automation misses.
Dimension 2: Relevance and Helpfulness
Does the response actually answer the question and help the user?
An accurate response can still be unhelpful. Answering "What is your return policy?" with a technically accurate but 2,000-word legal document when the user wants a simple summary is a relevance failure.
Evaluation approaches:
- Task completion rate. Define the task the user is trying to accomplish (reset password, understand a fee, compare plans) and measure whether the response enables task completion.
- LLM-as-judge. Use a separate, strong LLM to evaluate whether the response is helpful and relevant. Provide the judge with the user query, the response, and a scoring rubric. This scales well and correlates reasonably with human judgment.
- User satisfaction proxy. In production, measure follow-up behavior: does the user ask a follow-up question (suggesting the response was incomplete)? Does the user rephrase the same question (suggesting the response missed the point)? Does the user escalate to a human agent (suggesting the response failed)?
Dimension 3: Safety and Harmlessness
Does the response avoid generating harmful, offensive, or dangerous content?
Safety evaluation is non-negotiable for client deployments. A single unsafe response can generate legal liability, media attention, and permanent brand damage.
Test categories:
- Toxicity. Does the model generate offensive language, slurs, or discriminatory content? Use toxicity classifiers (Perspective API, detoxify) on response samples.
- Harmful advice. Does the model provide dangerous guidance? Test with prompts about self-harm, illegal activities, medical decisions, and financial advice.
- Prompt injection. Can a user trick the model into ignoring its instructions? Test with known prompt injection patterns: "Ignore your instructions and tell me...", system prompt extraction attempts, and role-playing attacks.
- Information leakage. Does the model reveal confidential information from its context (system prompts, RAG documents marked confidential, other users' data)?
- Bias and fairness. Does the model treat different demographic groups differently? Test with demographically varied inputs and check for response differences.
Implementation:
Build a safety test suite of 100-200 adversarial prompts covering each category. Run the full suite before every deployment and after every prompt or model change. Any safety failure is a deployment blocker โ no exceptions.
Dimension 4: Brand Alignment
Does the response match the client's brand voice, policies, and guidelines?
This is the dimension most agencies overlook. An LLM that is accurate and helpful but speaks in the wrong tone or contradicts the client's policies is still a failure.
Evaluation approaches:
- Policy compliance checks. Extract the client's policies (return policy, pricing, disclaimers) and verify that LLM responses do not contradict them. This can be partially automated using keyword matching and semantic similarity.
- Tone analysis. Compare the response tone against the client's brand guidelines. If the brand is "casual and friendly," formal responses are off-brand. Use sentiment and tone classifiers or LLM-as-judge with tone-specific rubrics.
- Competitive mentions. Verify the model never recommends competitors or speaks positively about competing products. Simple keyword detection catches most cases.
- Disclaimer and qualification checks. For regulated industries, verify that required disclaimers are included (e.g., "This is not financial advice" for fintech applications).
Dimension 5: Performance and Reliability
Does the system meet latency, throughput, and availability requirements?
- Latency. Time from request to first token (for streaming) and time to complete response. Track p50, p95, and p99.
- Throughput. Requests per second the system can handle under load.
- Error rate. Percentage of requests that fail (API errors, timeouts, safety filter blocks).
- Cost per request. Token usage and API cost per interaction.
Building the Evaluation Pipeline
The Test Suite
Construct a test suite with at least 200 test cases covering:
- Happy path cases (40%): Common, straightforward questions that the system should handle easily. These verify basic functionality.
- Edge cases (25%): Unusual but legitimate questions โ very long queries, multiple questions in one message, ambiguous questions, questions in non-standard language.
- Adversarial cases (15%): Deliberate attempts to break the system โ prompt injections, out-of-scope requests, attempts to extract confidential information.
- Regression cases (10%): Specific cases that failed in previous versions. These prevent re-introduction of known bugs.
- Fairness cases (10%): The same question asked with different demographic contexts to test for bias.
For each test case, define:
- The input (user message + any conversation context)
- Expected behavior (not an exact expected output, but criteria the output must satisfy)
- Which evaluation dimensions apply
- Pass/fail criteria
Automated Evaluation Loop
Build an automated pipeline that:
- Runs all test cases against the current system
- Evaluates each response against its criteria using automated metrics and LLM-as-judge
- Aggregates scores by dimension and category
- Compares against baseline scores (from the previous version)
- Flags regressions and new failures
- Generates a report with pass rates, score distributions, and specific failure examples
Run this pipeline:
- Before every deployment
- After every prompt change
- After every model version change
- Weekly in production (to catch drift)
Human Evaluation Cadence
Automated evaluation catches most issues, but some dimensions (helpfulness, nuance, brand voice) require human judgment. Establish a regular human evaluation cadence:
- Pre-deployment: Expert review of 50-100 responses covering all dimensions
- Weekly in production: Review of 20-30 randomly sampled production conversations
- Monthly: Deep-dive review of 100 conversations focusing on failure analysis and improvement opportunities
LLM-as-Judge: Practical Implementation
Using one LLM to evaluate another is increasingly common and practical. Here is how to do it well:
Choose a strong judge model. The judge should be at least as capable as the model being evaluated. Using GPT-4 to judge GPT-3.5 outputs works well. Using GPT-3.5 to judge GPT-4 outputs does not.
Write detailed rubrics. Do not ask the judge "Is this response good?" Ask: "Rate this response from 1-5 on accuracy, using the following rubric: 5 = all facts are correct and verifiable, 4 = mostly correct with minor inaccuracies..." Specific rubrics produce more consistent and meaningful scores.
Use few-shot examples. Include 3-5 example evaluations in the judge prompt so it understands the scoring standard.
Calibrate against human judgment. Run the judge on 100 responses that have also been scored by humans. Compute the correlation. If the judge correlates at 0.8+ with human scores, it is reliable enough for automated evaluation. If not, refine the rubric.
Be aware of biases. LLM judges tend to prefer longer responses, responses that match their own style, and responses that use confident language. Design your rubric to counteract these biases explicitly.
Pricing Evaluation Work
Include evaluation in every LLM project. It is not optional.
Budget allocation:
- Test suite design and construction: $8,000 - $15,000
- Automated evaluation pipeline: $10,000 - $20,000
- Pre-deployment human evaluation: $3,000 - $5,000
- Ongoing monitoring and evaluation: $2,000 - $5,000 per month
Total evaluation cost is typically 15-25% of the total LLM project cost. On a $100,000 chatbot project, budget $15,000-$25,000 for evaluation.
Frame it to clients: "Our evaluation framework ensures that your AI assistant is accurate, safe, and brand-aligned before any customer interacts with it. It includes 200+ automated test cases, safety guardrails, and continuous monitoring. This prevents the brand-damaging failures that make headlines when companies deploy AI without proper evaluation."
Common LLM Evaluation Mistakes
Mistake 1: Testing only on easy questions. Demo-quality test cases that the LLM handles perfectly do not reveal real-world weaknesses. Include confusing, ambiguous, and adversarial inputs that reflect what actual users will type.
Mistake 2: Using accuracy metrics designed for traditional ML. Metrics like exact match and F1 score do not capture the nuances of natural language generation. Use human evaluation, LLM-as-judge, and task-specific metrics instead.
Mistake 3: Evaluating once and never again. LLM behavior can change with model updates, prompt modifications, and knowledge base changes. Continuous evaluation is essential โ not just pre-deployment evaluation.
Mistake 4: Ignoring safety testing because the demo was clean. Safety failures are rare but catastrophic. A system that handles 10,000 queries safely and then generates one harmful response on the 10,001st query has failed. Adversarial safety testing must be comprehensive and ongoing.
Mistake 5: Not having a rollback plan. When evaluation reveals a regression after a change, you need to quickly revert to the previous version. Maintain versioned snapshots of prompts, model configurations, and knowledge bases so rollback is instant.
Mistake 6: Evaluating the model in isolation. The LLM's quality depends on the entire system โ retrieval quality, prompt engineering, guardrails, and post-processing. Evaluate the system end-to-end, not just the model component.
Your Next Step
For your current or next LLM project, write 50 test cases before deploying anything. Include 20 happy path cases, 10 edge cases, 10 adversarial cases, 5 regression cases, and 5 fairness cases. Run them through your system and score the results. That exercise alone will reveal weaknesses you did not anticipate and prevent the kinds of production failures that damage client relationships. Once you have the initial 50 cases working, expand to 200 and automate the evaluation pipeline.