AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The Quality Assurance FrameworkLayer 1: Data QualityLayer 2: Model QualityLayer 3: System QualityLayer 4: Process QualityThe QA ProcessQuality GatesQA Roles and ResponsibilitiesQA ToolingMeasuring QualityYour Next Step
Home/Blog/Quality Assurance Playbook for AI Delivery โ€” Building Quality Into Every Project
Operations

Quality Assurance Playbook for AI Delivery โ€” Building Quality Into Every Project

A

Agency Script Editorial

Editorial Team

ยทMarch 21, 2026ยท14 min read
quality assuranceQA processdelivery qualitytesting standards

A 30-person AI agency in San Francisco delivered a demand forecasting model to a retail client. The model passed all internal tests and performed well on the test dataset. Two weeks after deployment, the client reported that the model's predictions were wildly inaccurate for 15% of their product catalog. The investigation revealed that the training data had a systematic gap โ€” seasonal products were underrepresented because the training window did not cover a full seasonal cycle. The issue would have been caught by a thorough data quality review during development, but the team had rushed past that step to meet a tight deadline. The cost of fixing the model, retraining on expanded data, and managing the client relationship damage was $78,000 โ€” more than the original budget overrun they were trying to avoid by cutting corners.

Quality assurance in AI delivery is fundamentally different from traditional software QA. AI systems are probabilistic rather than deterministic โ€” they do not always produce the same output for the same input. They depend on data quality, which is variable and often outside your control. And they can fail silently โ€” producing plausible but incorrect results that go undetected until they cause real-world damage. Building quality into AI delivery requires a comprehensive approach that addresses data quality, model quality, system quality, and process quality.

The Quality Assurance Framework

Layer 1: Data Quality

Data quality is the foundation of AI quality. Every model is only as good as the data it was trained on.

Data quality dimensions:

  • Completeness: Are there missing values, gaps in time periods, or absent categories?
  • Accuracy: Does the data correctly represent reality? Are labels correct?
  • Consistency: Is the data internally consistent? Do related fields agree?
  • Timeliness: Is the data current enough for the intended use?
  • Representativeness: Does the data reflect the full distribution of real-world inputs, including edge cases and minority categories?
  • Bias: Does the data contain systematic biases that could lead to unfair or inaccurate model behavior?

Data quality checks:

  • Profiling: Before starting model development, profile the data โ€” distributions, missing values, outliers, correlations. Document findings and share with the client.
  • Validation rules: Define and implement automated checks for data quality dimensions (null checks, range checks, consistency checks, format checks).
  • Sample review: Manually review a random sample of records for accuracy and labeling quality.
  • Drift monitoring: Compare training data distributions to production data distributions to detect when the real-world data diverges from what the model was trained on.

Layer 2: Model Quality

Model quality encompasses performance, fairness, robustness, and explainability.

Model performance testing:

  • Holdout test set: Evaluate model performance on a dataset that was not used during training. This is the minimum quality gate for any model.
  • Cross-validation: For smaller datasets, use k-fold cross-validation to get a more reliable performance estimate.
  • Performance across segments: Evaluate model performance across different data segments (e.g., customer types, geographic regions, product categories). A model that performs well on average but poorly for specific segments may not be acceptable.
  • Edge case testing: Test model behavior on unusual or extreme inputs. What happens with missing features, outlier values, or combinations that are rare in the training data?
  • Performance thresholds: Define minimum acceptable performance metrics (accuracy, precision, recall, F1, AUC) before development begins. The model must meet these thresholds before advancing.

Fairness and bias testing:

  • Disparate impact analysis: Does the model produce different outcomes for different protected groups (race, gender, age, etc.)?
  • Equalized odds: Does the model have similar false positive and false negative rates across groups?
  • Bias metrics: Calculate relevant fairness metrics for the application domain.
  • Mitigation: If bias is detected, implement mitigation strategies (resampling, reweighting, fairness constraints).

Robustness testing:

  • Adversarial testing: Test model behavior when inputs are intentionally perturbed or manipulated.
  • Noise resilience: How does model performance degrade when input data has noise or errors?
  • Distribution shift: How does the model perform when the input distribution changes from what it was trained on?

Explainability review:

  • Feature importance: Which features drive the model's predictions? Are they reasonable and defensible?
  • Individual predictions: For high-stakes applications, can individual predictions be explained to stakeholders?
  • Consistency with domain knowledge: Do the model's patterns align with what domain experts expect?

Layer 3: System Quality

The model is just one component of the overall system. System quality ensures everything works together.

Integration testing:

  • Does the model integrate correctly with upstream data systems?
  • Does the model's output integrate correctly with downstream systems?
  • Are API contracts correct and complete?
  • Does error handling work as expected?

Performance testing:

  • Latency: Does the system respond within acceptable time limits?
  • Throughput: Can the system handle expected data volumes?
  • Scalability: Does performance degrade gracefully under load?
  • Resource utilization: Is compute, memory, and storage usage within expected bounds?

Security testing:

  • Are API endpoints secured with proper authentication and authorization?
  • Is data encrypted in transit and at rest?
  • Are there input validation protections against injection attacks?
  • Is access logging implemented?

Monitoring and alerting:

  • Are monitoring dashboards set up for key system metrics?
  • Are alerts configured for performance degradation, errors, and anomalies?
  • Is model performance monitoring in place to detect drift?
  • Are there runbooks for common alert scenarios?

Layer 4: Process Quality

Process quality ensures that the development and delivery process itself produces consistent results.

Code review: Every code change should be reviewed by at least one other engineer before merging:

  • Code quality: Is it clean, readable, and maintainable?
  • Logic correctness: Does it do what it is supposed to do?
  • Edge cases: Are edge cases handled?
  • Test coverage: Are there appropriate tests?
  • Security: Are there any security concerns?
  • Documentation: Is the code adequately documented?

Documentation review: All client-facing documentation should be reviewed:

  • Technical accuracy
  • Completeness
  • Clarity for the intended audience
  • Consistency with prior deliverables
  • Professional formatting and presentation

Deliverable review: Before any deliverable goes to the client, it should be reviewed by someone who did not create it:

  • Does it meet the acceptance criteria defined in the project plan?
  • Is it complete?
  • Is the quality consistent with agency standards?
  • Is it presented professionally?

The QA Process

Quality Gates

Quality gates are defined checkpoints in the project lifecycle where work must meet specific criteria before advancing.

Gate 1: Data readiness

  • Data has been profiled and quality assessed
  • Data quality issues are documented with remediation plan
  • Training, validation, and test datasets are created
  • Data documentation is complete

Gate 2: Baseline model

  • Baseline model is trained and evaluated
  • Performance metrics meet minimum thresholds
  • Initial fairness assessment is complete
  • Approach is validated by senior technical reviewer

Gate 3: Model ready for integration

  • Final model meets all performance thresholds
  • Fairness and bias testing is complete
  • Robustness testing is complete
  • Model documentation is complete
  • Senior technical review is passed

Gate 4: System ready for UAT

  • All integration points are tested
  • Performance testing is complete
  • Security review is passed
  • Monitoring and alerting are configured
  • System documentation is complete

Gate 5: Production ready

  • UAT is complete and client has signed off
  • All defects are resolved or accepted
  • Deployment runbook is complete and reviewed
  • Rollback procedure is tested
  • Production monitoring is validated

QA Roles and Responsibilities

Every engineer: Responsible for code quality, unit testing, and self-review Peer reviewer: Responsible for code review, spot-checking test coverage Technical lead: Responsible for architecture review, quality gate approval, and overall technical quality Project manager: Responsible for process quality, ensuring QA steps are followed, managing quality gate schedule QA specialist (if you have one): Responsible for test strategy, integration testing, and UAT coordination

QA Tooling

Automated testing:

  • Unit testing frameworks (pytest, jest)
  • Integration testing frameworks
  • CI/CD pipelines that run tests automatically on every code change
  • Model evaluation pipelines that calculate performance metrics automatically

Model quality tools:

  • Experiment tracking (MLflow, Weights & Biases)
  • Fairness assessment (IBM AI Fairness 360, Google What-If Tool)
  • Model monitoring (Evidently AI, Arize)

Code quality tools:

  • Linting (flake8, pylint, ESLint)
  • Formatting (black, prettier)
  • Static analysis (SonarQube, CodeClimate)

Measuring Quality

Quality metrics to track:

  • Defect rate: Defects found per deliverable or per sprint. Track trends over time.
  • Rework rate: Percentage of total hours spent on rework. Target: under 15%.
  • Test coverage: Percentage of code covered by automated tests. Target: 70%+.
  • Quality gate pass rate: Percentage of quality gates passed on first attempt. Target: 80%+.
  • Client-reported defects: Defects found by the client after delivery. Target: less than 2 per project.
  • Post-deployment incidents: Incidents in production caused by quality issues. Target: less than 1 per project.
  • Client satisfaction with quality: Survey score specific to deliverable quality. Target: 8+/10.

Your Next Step

This week:

  • Review your last three completed projects. How many quality issues were found after delivery? What was the cost of fixing them?
  • Check whether your current projects have defined quality gates. If not, add them.
  • Verify that code review is happening on all projects, consistently.

This month:

  • Define quality gates for your standard project lifecycle and document the criteria for each.
  • Implement automated testing in your CI/CD pipeline if you do not already have it.
  • Create a data quality checklist and use it at the start of every project.

This quarter:

  • Build a quality metrics dashboard and review it monthly.
  • Implement fairness and bias testing as a standard step for all models.
  • Conduct a quality retrospective across recent projects to identify patterns.
  • Invest in QA tooling where manual processes are creating bottlenecks or inconsistencies.

Quality is not a phase โ€” it is a practice woven through everything you do. The cheapest defect to fix is the one you prevent. The most expensive is the one the client finds in production. Build quality into your process from the start, and you will spend less time fixing problems and more time creating value.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Operations

Understaffed or Overstaffed? Both Camps Were Right.

You cannot manage what you cannot see. Here is how to build a team capacity dashboard that prevents burnout, eliminates bench time, and keeps projects staffed correctly.

A
Agency Script Editorial
March 21, 2026ยท12 min read
Operations

Optimizing Daily Standups for Distributed AI Agency Teams

Optimized standups keep distributed AI agency teams aligned without consuming the focused work time that engineers need to ship quality deliverables.

A
Agency Script Editorial
March 21, 2026ยท10 min read
Operations

Complete Utilization Rate Management Guide โ€” The Metric That Makes or Breaks Agency Profitability

A 5% shift in utilization can swing agency profit by 30% or more. Here is the definitive guide to measuring, managing, and optimizing the most important metric in your agency.

A
Agency Script Editorial
March 21, 2026ยท13 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification