AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What an Evaluation Harness IsEvaluation StrategiesStrategy 1: Benchmark EvaluationStrategy 2: LLM-as-Judge EvaluationStrategy 3: Human EvaluationStrategy 4: Behavioral TestingStrategy 5: Production EvaluationBuilding the Evaluation HarnessArchitectureDelivery TimelineEvaluation for LLM ApplicationsEvaluation Test Set ManagementEvaluation at Different Development StagesEvaluation Infrastructure ScalingBuilding Evaluation Into Team WorkflowsPricing Evaluation Harness EngagementsYour Next Step
Home/Blog/94 Percent Accuracy Still Missed the 8 Million Dollar Clause
Delivery

94 Percent Accuracy Still Missed the 8 Million Dollar Clause

A

Agency Script Editorial

Editorial Team

ยทMarch 21, 2026ยท13 min read
ai evaluationai testingmodel evaluationai quality assurance

A legal technology company deployed an AI contract review system that processed 400 contracts per week. The system had been evaluated on a test set of 200 contracts and achieved 94 percent accuracy. Everyone celebrated. Four months later, a client discovered that the system had missed a critical liability clause in a merger agreement โ€” a clause that would have cost them $8 million if the deal had closed without renegotiation. When the legal tech company investigated, they found that their test set had not included a single merger agreement. Their evaluation covered only five of the twelve contract types the system encountered in production. The 94 percent accuracy was real โ€” for the contract types they tested. For the types they did not test, accuracy was closer to 71 percent.

Systematic evaluation is not a nice-to-have. It is the difference between AI that works reliably and AI that works until it does not. For your agency, building evaluation harnesses is both a standalone service offering and a critical component of every AI delivery engagement.

What an Evaluation Harness Is

An evaluation harness is a reusable, automated framework for testing AI systems against comprehensive test suites that cover the full range of expected (and unexpected) inputs, conditions, and failure modes.

It is not a test set. A test set is a static collection of input-output pairs. An evaluation harness is a dynamic system that manages test datasets, runs evaluations, computes metrics, tracks results over time, and supports multiple evaluation strategies.

Components of a complete evaluation harness:

  • Test data management: Curated, versioned, and categorized test datasets
  • Evaluation runners: Automated pipelines that execute tests against model endpoints or local model instances
  • Metric computation: Pluggable metric calculators for different evaluation types (accuracy, semantic similarity, latency, cost)
  • Result storage and tracking: Historical database of evaluation results for trend analysis and regression detection
  • Reporting: Dashboards and reports that communicate evaluation results to technical and non-technical stakeholders
  • Comparison tools: Side-by-side comparison of model versions, prompt variants, or configuration changes
  • CI/CD integration: Automated evaluation triggered by model changes, data changes, or code changes

Evaluation Strategies

Strategy 1: Benchmark Evaluation

Test the AI system against a curated benchmark dataset with known correct answers.

When to use: When ground truth is available and the task can be objectively scored (classification, extraction, question answering with verifiable answers).

How to build the benchmark:

  • Comprehensive coverage: The benchmark must cover the full range of inputs the system will see in production. Stratify by input type, difficulty level, edge cases, and demographic groups.
  • Representative distribution: The distribution of cases in the benchmark should match production distribution. If 60 percent of production inputs are type A and 40 percent are type B, the benchmark should reflect that.
  • Adversarial examples: Include inputs specifically designed to trip up the model โ€” ambiguous cases, edge cases, inputs with conflicting signals, and inputs that have historically caused errors.
  • Living dataset: The benchmark should be updated regularly to include new failure modes discovered in production, new input types, and evolving adversarial attacks.

Size guidelines: Minimum 500 examples for binary classification. Minimum 100 examples per class for multi-class classification. Minimum 1,000 examples for regression. For generative tasks, 200 to 500 examples with human-evaluated reference outputs.

Strategy 2: LLM-as-Judge Evaluation

Use a large language model to evaluate the outputs of another AI system. This is essential for evaluating generative AI where there is no single correct answer.

When to use: When outputs are text, summaries, code, or creative content where quality is subjective and cannot be scored by exact match.

How to implement:

  • Define evaluation criteria (relevance, accuracy, completeness, coherence, safety, helpfulness)
  • Create a detailed rubric for each criterion with specific scoring guidelines
  • Write evaluation prompts that present the input, the system output, and the rubric to the judge model
  • Use a more capable model as the judge (evaluate GPT-3.5 outputs with GPT-4 class models)
  • Calibrate the judge against human evaluations on a subset of 50 to 100 examples
  • Track judge-human agreement rate over time

Common pitfalls:

  • Judge models have their own biases (verbose outputs are often rated higher than concise ones)
  • Judge models can be inconsistent across runs (run each evaluation 3 to 5 times and average)
  • Judge models struggle with domain-specific quality criteria (calibrate with domain experts)

Strategy 3: Human Evaluation

Use human evaluators to assess AI system outputs. The gold standard for quality evaluation but expensive and slow.

When to use: For high-stakes applications, for calibrating automated evaluations, and for evaluating qualities that AI judges cannot assess reliably (cultural sensitivity, legal accuracy, medical safety).

How to implement:

  • Define the evaluation task with clear instructions, examples, and a scoring rubric
  • Recruit evaluators with appropriate domain expertise
  • Use multiple evaluators per example (minimum three) and measure inter-annotator agreement
  • Implement quality control (attention checks, calibration examples, evaluator consistency tracking)
  • Build a human evaluation interface that is efficient and consistent

Cost management:

  • Human evaluation is expensive ($15 to $50 per hour for general evaluators, $50 to $200 per hour for domain experts)
  • Use human evaluation strategically โ€” for calibrating automated evaluations, for evaluating high-stakes outputs, and for periodic quality audits rather than continuous testing
  • Sample strategically โ€” evaluate a representative sample rather than every output

Strategy 4: Behavioral Testing

Test the AI system's behavior against specific behavioral expectations, similar to unit testing in software engineering.

When to use: For verifying specific capabilities, safety properties, and invariance properties.

Test categories:

  • Capability tests: Can the model do what it is supposed to do? Test each expected capability independently.
  • Safety tests: Does the model refuse harmful requests? Does it avoid generating toxic content? Does it protect private information?
  • Invariance tests: Does the model give consistent outputs for inputs that should produce the same output? For example, a sentiment classifier should give the same sentiment for "The food was great" and "The food was really great."
  • Directional tests: Does the model's output change in the expected direction when the input changes in a specific way? For example, adding a positive adjective should increase sentiment score.
  • Robustness tests: Does the model handle typos, formatting variations, and language variations gracefully?
  • Fairness tests: Does the model perform equally across demographic groups? Test for performance disparities by gender, race, age, and other protected attributes.

Strategy 5: Production Evaluation

Evaluate the AI system's performance in production using real user interactions and outcomes.

When to use: Always, as a complement to pre-deployment evaluation. Production evaluation catches issues that no test suite can anticipate.

How to implement:

  • Log all model inputs, outputs, and metadata in production
  • Sample production interactions for evaluation (human or LLM-as-judge)
  • Correlate model outputs with downstream outcomes (did the user accept the recommendation? did the transaction succeed? did the customer call back?)
  • Track production metrics over time and set alerts for degradation
  • Feed production evaluation results back into the benchmark to improve pre-deployment testing

Building the Evaluation Harness

Architecture

Test data store: Version-controlled repository of test datasets with metadata (creation date, source, category, expected difficulty). Use a combination of Git (for small datasets) and cloud storage with version tags (for large datasets).

Evaluation engine: Orchestrates evaluation runs. Manages parallelism (run evaluations across multiple model instances for speed). Handles retries and error recovery. Supports multiple evaluation strategies (benchmark, LLM-as-judge, behavioral).

Metric calculator: Pluggable framework for computing evaluation metrics. Standard metrics (accuracy, F1, BLEU, ROUGE, semantic similarity) are built-in. Custom metrics can be added for domain-specific quality criteria.

Result store: Time-series database storing evaluation results with full context (model version, prompt version, test dataset version, configuration). Enables trend analysis and regression detection.

Comparison engine: Generates side-by-side comparisons of evaluation results across model versions, prompt variants, or configuration changes. Computes statistical significance of performance differences.

Reporting layer: Dashboards for engineers (detailed metrics, failure analysis) and for stakeholders (summary metrics, trend charts, quality ratings).

Delivery Timeline

Phase 1: Framework and infrastructure (Weeks 1-4)

  • Deploy the evaluation engine and result store
  • Implement standard metric calculators
  • Build the CI/CD integration for automated evaluation triggers
  • Create the reporting dashboard

Phase 2: Test suite development (Weeks 5-10)

  • Work with domain experts to build benchmark datasets
  • Create behavioral test suites for each AI system
  • Implement LLM-as-judge evaluations and calibrate against human judgments
  • Build adversarial test sets

Phase 3: Integration and automation (Weeks 11-14)

  • Integrate with the model development pipeline (evaluate every model change)
  • Integrate with production monitoring (continuous evaluation on production data)
  • Implement alerting for evaluation failures and regressions
  • Train the client's team on using and extending the evaluation harness

Evaluation for LLM Applications

LLM evaluation is particularly challenging because outputs are open-ended text with no single "correct" answer. Traditional metrics (accuracy, F1) do not apply.

LLM-as-judge evaluation. Use a capable LLM (GPT-4, Claude) as an automated evaluator. Provide the evaluator with the input, the model's output, and evaluation criteria (relevance, accuracy, helpfulness, safety). The evaluator scores the output on each criterion. This approach scales well but requires calibration โ€” compare LLM-as-judge scores against human evaluator scores on a sample and adjust criteria until correlation is high.

Rubric-based human evaluation. For the highest-quality evaluation, use human evaluators with detailed rubrics. Define scoring criteria (1 to 5 scale for relevance, accuracy, completeness, safety, and style), provide examples of each score level, and have multiple evaluators score each output. Human evaluation is expensive and slow but provides the ground truth that automated evaluation is calibrated against.

Reference-based evaluation. For tasks with expected outputs (summarization, translation, question answering), compare model outputs against reference outputs using metrics like ROUGE (for summarization), BLEU (for translation), or semantic similarity (embedding cosine similarity). These metrics are imperfect but provide fast, automated quality signals.

Behavioral evaluation. Test specific behaviors rather than overall quality. Does the model refuse harmful requests? Does it stay within its defined scope? Does it cite sources when instructed? Does it handle ambiguous queries gracefully? Each behavior becomes a test case that can be automatically evaluated.

Evaluation Test Set Management

The quality of your evaluation depends on the quality of your test sets. Poorly constructed test sets give misleading results.

Test set construction principles: Include diverse examples that cover the full range of expected inputs โ€” common queries, edge cases, adversarial inputs, and out-of-scope queries. Balance the test set across difficulty levels and input types. Ensure test set examples are independent โ€” do not include multiple variations of the same example that would inflate the apparent sample size.

Test set maintenance: Update test sets as the application evolves and new patterns emerge in production traffic. Add test cases for failure modes discovered in production. Remove test cases that are no longer relevant. Schedule quarterly test set reviews.

Test set contamination prevention: Ensure that test set examples are never included in training data. For fine-tuned models, this means maintaining strict separation between training and evaluation data. For prompt-based systems, this means the test set should not be used as few-shot examples in prompts.

Evaluation at Different Development Stages

During development: Run evaluations frequently (daily or after every significant change) on a small, fast test set (50 to 100 examples). The goal is rapid feedback on whether changes improve or degrade quality.

Before deployment: Run comprehensive evaluation on the full test set (500 to 2,000 examples) including benchmark, behavioral, adversarial, and fairness evaluations. This is the quality gate that determines whether the model is ready for production.

In production: Run continuous evaluation on sampled production inputs. This catches quality degradation that the pre-deployment test set did not predict. Use a combination of automated evaluation and periodic human review.

Evaluation Infrastructure Scaling

As the number of AI systems and evaluation frequency grows, the evaluation infrastructure itself must scale.

Parallel evaluation execution. Run evaluations across multiple compute instances simultaneously. A benchmark with 2,000 test cases can take hours on a single instance but minutes when distributed across 20 instances. The evaluation harness should support automatic parallelization with result aggregation.

Evaluation caching. When only the prompt changes but the model stays the same, cache the model's responses for unchanged test cases and only re-evaluate cases affected by the prompt change. This dramatically reduces evaluation time for prompt optimization workflows where many small changes are tested iteratively.

Scheduled evaluation runs. Configure regular evaluation runs (nightly, weekly) that benchmark all production models against current test suites. Scheduled runs catch gradual degradation that might not be noticed during ad-hoc testing. Results are stored in the evaluation history and compared against previous runs automatically.

Evaluation cost management. For LLM evaluations where each test case requires an API call, evaluation costs can add up quickly. A 1,000-case benchmark evaluated by GPT-4 as a judge costs approximately $50 to $100 per run. Manage costs by using cheaper models for development-stage evaluations and reserving expensive evaluators for pre-production gates. Implement evaluation budgets per team and track spending against those budgets.

Building Evaluation Into Team Workflows

The evaluation harness provides the most value when it is integrated into the daily workflow of every team that builds or maintains AI systems.

Developer-friendly interfaces. Provide a simple CLI or API that lets developers run evaluations from their development environment with a single command. If running an evaluation requires navigating a complex web interface or writing configuration files, developers will skip it. Make evaluation as easy as running a unit test.

Pull request integration. Automatically run evaluations on every pull request that changes model code, prompts, or configuration. Display evaluation results directly in the pull request review interface so reviewers can see the quality impact of the proposed change. Block merging when evaluation results show significant regression.

Evaluation leaderboards. Maintain internal leaderboards showing model performance across all benchmarks. This creates visibility into which models are improving and which are degrading. It also creates healthy competition between teams and a shared understanding of the organization's AI quality standards.

Failure analysis tooling. When a model fails on specific test cases, the evaluation harness should make it easy to analyze why. Provide tools for inspecting model inputs, outputs, and expected outputs side by side. Group failures by category (similar error patterns, specific input types) so developers can identify systematic issues rather than fixing failures one at a time.

Pricing Evaluation Harness Engagements

  • Evaluation strategy design: $10,000 to $25,000
  • Harness build (single AI system): $40,000 to $100,000
  • Enterprise evaluation platform (multi-system): $100,000 to $250,000
  • Ongoing evaluation operations and test suite maintenance: $5,000 to $15,000 per month

Your Next Step

This week: Review every AI system your agency has built or is building. What evaluation strategy is in place? If the answer is "we tested it on some examples and it seemed to work," you have an immediate opportunity to add systematic evaluation.

This month: Build a reusable evaluation harness template that your team can deploy for any new AI engagement. Include benchmark evaluation, LLM-as-judge evaluation, and basic behavioral testing out of the box.

This quarter: Deliver your first standalone evaluation harness engagement. Target a client who has AI systems in production without systematic evaluation and position it as risk mitigation and quality assurance.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification