94 Percent Accuracy Still Missed the 8 Million Dollar Clause

A legal technology company deployed an AI contract review system that processed 400 contracts per week. The system had been evaluated on a test set of 200 contracts and achieved 94 percent accuracy. Everyone celebrated. Four months later, a client discovered that the system had missed a critical liability clause in a merger agreement — a clause that would have cost them $8 million if the deal had closed without renegotiation. When the legal tech company investigated, they found that their test set had not included a single merger agreement. Their evaluation covered only five of the twelve contract types the system encountered in production. The 94 percent accuracy was real — for the contract types they tested. For the types they did not test, accuracy was closer to 71 percent.

Systematic evaluation is not a nice-to-have. It is the difference between AI that works reliably and AI that works until it does not. For your agency, building evaluation harnesses is both a standalone service offering and a critical component of every AI delivery engagement.

What an Evaluation Harness Is

An evaluation harness is a reusable, automated framework for testing AI systems against comprehensive test suites that cover the full range of expected (and unexpected) inputs, conditions, and failure modes.

It is not a test set. A test set is a static collection of input-output pairs. An evaluation harness is a dynamic system that manages test datasets, runs evaluations, computes metrics, tracks results over time, and supports multiple evaluation strategies.

Components of a complete evaluation harness:

Test data management: Curated, versioned, and categorized test datasets
Evaluation runners: Automated pipelines that execute tests against model endpoints or local model instances
Metric computation: Pluggable metric calculators for different evaluation types (accuracy, semantic similarity, latency, cost)
Result storage and tracking: Historical database of evaluation results for trend analysis and regression detection
Reporting: Dashboards and reports that communicate evaluation results to technical and non-technical stakeholders
Comparison tools: Side-by-side comparison of model versions, prompt variants, or configuration changes
CI/CD integration: Automated evaluation triggered by model changes, data changes, or code changes

Evaluation Strategies

Strategy 1: Benchmark Evaluation

Test the AI system against a curated benchmark dataset with known correct answers.

When to use: When ground truth is available and the task can be objectively scored (classification, extraction, question answering with verifiable answers).

How to build the benchmark:

Comprehensive coverage: The benchmark must cover the full range of inputs the system will see in production. Stratify by input type, difficulty level, edge cases, and demographic groups.
Representative distribution: The distribution of cases in the benchmark should match production distribution. If 60 percent of production inputs are type A and 40 percent are type B, the benchmark should reflect that.
Adversarial examples: Include inputs specifically designed to trip up the model — ambiguous cases, edge cases, inputs with conflicting signals, and inputs that have historically caused errors.
Living dataset: The benchmark should be updated regularly to include new failure modes discovered in production, new input types, and evolving adversarial attacks.

Size guidelines: Minimum 500 examples for binary classification. Minimum 100 examples per class for multi-class classification. Minimum 1,000 examples for regression. For generative tasks, 200 to 500 examples with human-evaluated reference outputs.

Strategy 2: LLM-as-Judge Evaluation

Use a large language model to evaluate the outputs of another AI system. This is essential for evaluating generative AI where there is no single correct answer.

When to use: When outputs are text, summaries, code, or creative content where quality is subjective and cannot be scored by exact match.

How to implement:

Define evaluation criteria (relevance, accuracy, completeness, coherence, safety, helpfulness)
Create a detailed rubric for each criterion with specific scoring guidelines
Write evaluation prompts that present the input, the system output, and the rubric to the judge model
Use a more capable model as the judge (evaluate GPT-3.5 outputs with GPT-4 class models)
Calibrate the judge against human evaluations on a subset of 50 to 100 examples
Track judge-human agreement rate over time

Common pitfalls:

Judge models have their own biases (verbose outputs are often rated higher than concise ones)
Judge models can be inconsistent across runs (run each evaluation 3 to 5 times and average)
Judge models struggle with domain-specific quality criteria (calibrate with domain experts)

Strategy 3: Human Evaluation

Use human evaluators to assess AI system outputs. The gold standard for quality evaluation but expensive and slow.

When to use: For high-stakes applications, for calibrating automated evaluations, and for evaluating qualities that AI judges cannot assess reliably (cultural sensitivity, legal accuracy, medical safety).

How to implement:

Define the evaluation task with clear instructions, examples, and a scoring rubric
Recruit evaluators with appropriate domain expertise
Use multiple evaluators per example (minimum three) and measure inter-annotator agreement
Implement quality control (attention checks, calibration examples, evaluator consistency tracking)
Build a human evaluation interface that is efficient and consistent

Cost management:

Human evaluation is expensive ($15 to $50 per hour for general evaluators, $50 to $200 per hour for domain experts)
Use human evaluation strategically — for calibrating automated evaluations, for evaluating high-stakes outputs, and for periodic quality audits rather than continuous testing
Sample strategically — evaluate a representative sample rather than every output

Strategy 4: Behavioral Testing

Test the AI system's behavior against specific behavioral expectations, similar to unit testing in software engineering.

When to use: For verifying specific capabilities, safety properties, and invariance properties.

Test categories:

Capability tests: Can the model do what it is supposed to do? Test each expected capability independently.
Safety tests: Does the model refuse harmful requests? Does it avoid generating toxic content? Does it protect private information?
Invariance tests: Does the model give consistent outputs for inputs that should produce the same output? For example, a sentiment classifier should give the same sentiment for "The food was great" and "The food was really great."
Directional tests: Does the model's output change in the expected direction when the input changes in a specific way? For example, adding a positive adjective should increase sentiment score.
Robustness tests: Does the model handle typos, formatting variations, and language variations gracefully?
Fairness tests: Does the model perform equally across demographic groups? Test for performance disparities by gender, race, age, and other protected attributes.

Strategy 5: Production Evaluation

Evaluate the AI system's performance in production using real user interactions and outcomes.

When to use: Always, as a complement to pre-deployment evaluation. Production evaluation catches issues that no test suite can anticipate.

How to implement:

Log all model inputs, outputs, and metadata in production
Sample production interactions for evaluation (human or LLM-as-judge)
Correlate model outputs with downstream outcomes (did the user accept the recommendation? did the transaction succeed? did the customer call back?)
Track production metrics over time and set alerts for degradation
Feed production evaluation results back into the benchmark to improve pre-deployment testing

Building the Evaluation Harness

Architecture

Test data store: Version-controlled repository of test datasets with metadata (creation date, source, category, expected difficulty). Use a combination of Git (for small datasets) and cloud storage with version tags (for large datasets).

Evaluation engine: Orchestrates evaluation runs. Manages parallelism (run evaluations across multiple model instances for speed). Handles retries and error recovery. Supports multiple evaluation strategies (benchmark, LLM-as-judge, behavioral).

Metric calculator: Pluggable framework for computing evaluation metrics. Standard metrics (accuracy, F1, BLEU, ROUGE, semantic similarity) are built-in. Custom metrics can be added for domain-specific quality criteria.

Result store: Time-series database storing evaluation results with full context (model version, prompt version, test dataset version, configuration). Enables trend analysis and regression detection.

Comparison engine: Generates side-by-side comparisons of evaluation results across model versions, prompt variants, or configuration changes. Computes statistical significance of performance differences.

Reporting layer: Dashboards for engineers (detailed metrics, failure analysis) and for stakeholders (summary metrics, trend charts, quality ratings).

Delivery Timeline

Phase 1: Framework and infrastructure (Weeks 1-4)

Deploy the evaluation engine and result store
Implement standard metric calculators
Build the CI/CD integration for automated evaluation triggers
Create the reporting dashboard

Phase 2: Test suite development (Weeks 5-10)

Work with domain experts to build benchmark datasets
Create behavioral test suites for each AI system
Implement LLM-as-judge evaluations and calibrate against human judgments
Build adversarial test sets

Phase 3: Integration and automation (Weeks 11-14)

Integrate with the model development pipeline (evaluate every model change)
Integrate with production monitoring (continuous evaluation on production data)
Implement alerting for evaluation failures and regressions
Train the client's team on using and extending the evaluation harness

Evaluation for LLM Applications

LLM evaluation is particularly challenging because outputs are open-ended text with no single "correct" answer. Traditional metrics (accuracy, F1) do not apply.

LLM-as-judge evaluation. Use a capable LLM (GPT-4, Claude) as an automated evaluator. Provide the evaluator with the input, the model's output, and evaluation criteria (relevance, accuracy, helpfulness, safety). The evaluator scores the output on each criterion. This approach scales well but requires calibration — compare LLM-as-judge scores against human evaluator scores on a sample and adjust criteria until correlation is high.

Rubric-based human evaluation. For the highest-quality evaluation, use human evaluators with detailed rubrics. Define scoring criteria (1 to 5 scale for relevance, accuracy, completeness, safety, and style), provide examples of each score level, and have multiple evaluators score each output. Human evaluation is expensive and slow but provides the ground truth that automated evaluation is calibrated against.

Reference-based evaluation. For tasks with expected outputs (summarization, translation, question answering), compare model outputs against reference outputs using metrics like ROUGE (for summarization), BLEU (for translation), or semantic similarity (embedding cosine similarity). These metrics are imperfect but provide fast, automated quality signals.

Behavioral evaluation. Test specific behaviors rather than overall quality. Does the model refuse harmful requests? Does it stay within its defined scope? Does it cite sources when instructed? Does it handle ambiguous queries gracefully? Each behavior becomes a test case that can be automatically evaluated.

Evaluation Test Set Management

The quality of your evaluation depends on the quality of your test sets. Poorly constructed test sets give misleading results.

Test set construction principles: Include diverse examples that cover the full range of expected inputs — common queries, edge cases, adversarial inputs, and out-of-scope queries. Balance the test set across difficulty levels and input types. Ensure test set examples are independent — do not include multiple variations of the same example that would inflate the apparent sample size.

Test set maintenance: Update test sets as the application evolves and new patterns emerge in production traffic. Add test cases for failure modes discovered in production. Remove test cases that are no longer relevant. Schedule quarterly test set reviews.

Test set contamination prevention: Ensure that test set examples are never included in training data. For fine-tuned models, this means maintaining strict separation between training and evaluation data. For prompt-based systems, this means the test set should not be used as few-shot examples in prompts.

Evaluation at Different Development Stages

During development: Run evaluations frequently (daily or after every significant change) on a small, fast test set (50 to 100 examples). The goal is rapid feedback on whether changes improve or degrade quality.

Before deployment: Run comprehensive evaluation on the full test set (500 to 2,000 examples) including benchmark, behavioral, adversarial, and fairness evaluations. This is the quality gate that determines whether the model is ready for production.

In production: Run continuous evaluation on sampled production inputs. This catches quality degradation that the pre-deployment test set did not predict. Use a combination of automated evaluation and periodic human review.

Evaluation Infrastructure Scaling

As the number of AI systems and evaluation frequency grows, the evaluation infrastructure itself must scale.

Parallel evaluation execution. Run evaluations across multiple compute instances simultaneously. A benchmark with 2,000 test cases can take hours on a single instance but minutes when distributed across 20 instances. The evaluation harness should support automatic parallelization with result aggregation.

Evaluation caching. When only the prompt changes but the model stays the same, cache the model's responses for unchanged test cases and only re-evaluate cases affected by the prompt change. This dramatically reduces evaluation time for prompt optimization workflows where many small changes are tested iteratively.

Scheduled evaluation runs. Configure regular evaluation runs (nightly, weekly) that benchmark all production models against current test suites. Scheduled runs catch gradual degradation that might not be noticed during ad-hoc testing. Results are stored in the evaluation history and compared against previous runs automatically.

Evaluation cost management. For LLM evaluations where each test case requires an API call, evaluation costs can add up quickly. A 1,000-case benchmark evaluated by GPT-4 as a judge costs approximately $50 to $100 per run. Manage costs by using cheaper models for development-stage evaluations and reserving expensive evaluators for pre-production gates. Implement evaluation budgets per team and track spending against those budgets.

Building Evaluation Into Team Workflows

The evaluation harness provides the most value when it is integrated into the daily workflow of every team that builds or maintains AI systems.

Developer-friendly interfaces. Provide a simple CLI or API that lets developers run evaluations from their development environment with a single command. If running an evaluation requires navigating a complex web interface or writing configuration files, developers will skip it. Make evaluation as easy as running a unit test.

Pull request integration. Automatically run evaluations on every pull request that changes model code, prompts, or configuration. Display evaluation results directly in the pull request review interface so reviewers can see the quality impact of the proposed change. Block merging when evaluation results show significant regression.

Evaluation leaderboards. Maintain internal leaderboards showing model performance across all benchmarks. This creates visibility into which models are improving and which are degrading. It also creates healthy competition between teams and a shared understanding of the organization's AI quality standards.

Failure analysis tooling. When a model fails on specific test cases, the evaluation harness should make it easy to analyze why. Provide tools for inspecting model inputs, outputs, and expected outputs side by side. Group failures by category (similar error patterns, specific input types) so developers can identify systematic issues rather than fixing failures one at a time.

Pricing Evaluation Harness Engagements

Evaluation strategy design: $10,000 to $25,000
Harness build (single AI system): $40,000 to $100,000
Enterprise evaluation platform (multi-system): $100,000 to $250,000
Ongoing evaluation operations and test suite maintenance: $5,000 to $15,000 per month

Your Next Step

This week: Review every AI system your agency has built or is building. What evaluation strategy is in place? If the answer is "we tested it on some examples and it seemed to work," you have an immediate opportunity to add systematic evaluation.

This month: Build a reusable evaluation harness template that your team can deploy for any new AI engagement. Include benchmark evaluation, LLM-as-judge evaluation, and basic behavioral testing out of the box.

This quarter: Deliver your first standalone evaluation harness engagement. Target a client who has AI systems in production without systematic evaluation and position it as risk mitigation and quality assurance.

What an Evaluation Harness Is

Components of a complete evaluation harness:

Test data management: Curated, versioned, and categorized test datasets
Evaluation runners: Automated pipelines that execute tests against model endpoints or local model instances
Metric computation: Pluggable metric calculators for different evaluation types (accuracy, semantic similarity, latency, cost)
Result storage and tracking: Historical database of evaluation results for trend analysis and regression detection
Reporting: Dashboards and reports that communicate evaluation results to technical and non-technical stakeholders
Comparison tools: Side-by-side comparison of model versions, prompt variants, or configuration changes
CI/CD integration: Automated evaluation triggered by model changes, data changes, or code changes

Evaluation Strategies

Strategy 1: Benchmark Evaluation

Test the AI system against a curated benchmark dataset with known correct answers.

When to use: When ground truth is available and the task can be objectively scored (classification, extraction, question answering with verifiable answers).

How to build the benchmark:

Comprehensive coverage: The benchmark must cover the full range of inputs the system will see in production. Stratify by input type, difficulty level, edge cases, and demographic groups.
Representative distribution: The distribution of cases in the benchmark should match production distribution. If 60 percent of production inputs are type A and 40 percent are type B, the benchmark should reflect that.
Adversarial examples: Include inputs specifically designed to trip up the model — ambiguous cases, edge cases, inputs with conflicting signals, and inputs that have historically caused errors.
Living dataset: The benchmark should be updated regularly to include new failure modes discovered in production, new input types, and evolving adversarial attacks.

Strategy 2: LLM-as-Judge Evaluation

Use a large language model to evaluate the outputs of another AI system. This is essential for evaluating generative AI where there is no single correct answer.

When to use: When outputs are text, summaries, code, or creative content where quality is subjective and cannot be scored by exact match.

How to implement:

Define evaluation criteria (relevance, accuracy, completeness, coherence, safety, helpfulness)
Create a detailed rubric for each criterion with specific scoring guidelines
Write evaluation prompts that present the input, the system output, and the rubric to the judge model
Use a more capable model as the judge (evaluate GPT-3.5 outputs with GPT-4 class models)
Calibrate the judge against human evaluations on a subset of 50 to 100 examples
Track judge-human agreement rate over time

Common pitfalls:

Judge models have their own biases (verbose outputs are often rated higher than concise ones)
Judge models can be inconsistent across runs (run each evaluation 3 to 5 times and average)
Judge models struggle with domain-specific quality criteria (calibrate with domain experts)

Strategy 3: Human Evaluation

Use human evaluators to assess AI system outputs. The gold standard for quality evaluation but expensive and slow.

How to implement:

Define the evaluation task with clear instructions, examples, and a scoring rubric
Recruit evaluators with appropriate domain expertise
Use multiple evaluators per example (minimum three) and measure inter-annotator agreement
Implement quality control (attention checks, calibration examples, evaluator consistency tracking)
Build a human evaluation interface that is efficient and consistent

Cost management:

Human evaluation is expensive ($15 to $50 per hour for general evaluators, $50 to $200 per hour for domain experts)
Use human evaluation strategically — for calibrating automated evaluations, for evaluating high-stakes outputs, and for periodic quality audits rather than continuous testing
Sample strategically — evaluate a representative sample rather than every output

Strategy 4: Behavioral Testing

Test the AI system's behavior against specific behavioral expectations, similar to unit testing in software engineering.

When to use: For verifying specific capabilities, safety properties, and invariance properties.

Test categories:

Capability tests: Can the model do what it is supposed to do? Test each expected capability independently.
Safety tests: Does the model refuse harmful requests? Does it avoid generating toxic content? Does it protect private information?
Invariance tests: Does the model give consistent outputs for inputs that should produce the same output? For example, a sentiment classifier should give the same sentiment for "The food was great" and "The food was really great."
Directional tests: Does the model's output change in the expected direction when the input changes in a specific way? For example, adding a positive adjective should increase sentiment score.
Robustness tests: Does the model handle typos, formatting variations, and language variations gracefully?
Fairness tests: Does the model perform equally across demographic groups? Test for performance disparities by gender, race, age, and other protected attributes.

Strategy 5: Production Evaluation

Evaluate the AI system's performance in production using real user interactions and outcomes.

When to use: Always, as a complement to pre-deployment evaluation. Production evaluation catches issues that no test suite can anticipate.

How to implement:

Log all model inputs, outputs, and metadata in production
Sample production interactions for evaluation (human or LLM-as-judge)
Correlate model outputs with downstream outcomes (did the user accept the recommendation? did the transaction succeed? did the customer call back?)
Track production metrics over time and set alerts for degradation
Feed production evaluation results back into the benchmark to improve pre-deployment testing

Building the Evaluation Harness

Architecture

Reporting layer: Dashboards for engineers (detailed metrics, failure analysis) and for stakeholders (summary metrics, trend charts, quality ratings).

Delivery Timeline

Phase 1: Framework and infrastructure (Weeks 1-4)

Deploy the evaluation engine and result store
Implement standard metric calculators
Build the CI/CD integration for automated evaluation triggers
Create the reporting dashboard

Phase 2: Test suite development (Weeks 5-10)

Work with domain experts to build benchmark datasets
Create behavioral test suites for each AI system
Implement LLM-as-judge evaluations and calibrate against human judgments
Build adversarial test sets

Phase 3: Integration and automation (Weeks 11-14)

Integrate with the model development pipeline (evaluate every model change)
Integrate with production monitoring (continuous evaluation on production data)
Implement alerting for evaluation failures and regressions
Train the client's team on using and extending the evaluation harness

Evaluation for LLM Applications

LLM evaluation is particularly challenging because outputs are open-ended text with no single "correct" answer. Traditional metrics (accuracy, F1) do not apply.

Evaluation Test Set Management

The quality of your evaluation depends on the quality of your test sets. Poorly constructed test sets give misleading results.

Evaluation at Different Development Stages

Evaluation Infrastructure Scaling

As the number of AI systems and evaluation frequency grows, the evaluation infrastructure itself must scale.

Building Evaluation Into Team Workflows

The evaluation harness provides the most value when it is integrated into the daily workflow of every team that builds or maintains AI systems.

Pricing Evaluation Harness Engagements

Evaluation strategy design: $10,000 to $25,000
Harness build (single AI system): $40,000 to $100,000
Enterprise evaluation platform (multi-system): $100,000 to $250,000
Ongoing evaluation operations and test suite maintenance: $5,000 to $15,000 per month

94 Percent Accuracy Still Missed the 8 Million Dollar Clause

What an Evaluation Harness Is

Evaluation Strategies

Strategy 1: Benchmark Evaluation

Strategy 2: LLM-as-Judge Evaluation

Strategy 3: Human Evaluation

Strategy 4: Behavioral Testing

Strategy 5: Production Evaluation

Building the Evaluation Harness

Architecture

Delivery Timeline

Evaluation for LLM Applications

Evaluation Test Set Management

Evaluation at Different Development Stages

Evaluation Infrastructure Scaling

Building Evaluation Into Team Workflows

Pricing Evaluation Harness Engagements

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?

94 Percent Accuracy Still Missed the 8 Million Dollar Clause

What an Evaluation Harness Is

Evaluation Strategies

Strategy 1: Benchmark Evaluation

Strategy 2: LLM-as-Judge Evaluation

Strategy 3: Human Evaluation

Strategy 4: Behavioral Testing

Strategy 5: Production Evaluation

Building the Evaluation Harness

Architecture

Delivery Timeline

Evaluation for LLM Applications

Evaluation Test Set Management

Evaluation at Different Development Stages

Evaluation Infrastructure Scaling

Building Evaluation Into Team Workflows

Pricing Evaluation Harness Engagements

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?