Quality Assurance Playbook for AI Delivery — Building Quality Into Every Project

A 30-person AI agency in San Francisco delivered a demand forecasting model to a retail client. The model passed all internal tests and performed well on the test dataset. Two weeks after deployment, the client reported that the model's predictions were wildly inaccurate for 15% of their product catalog. The investigation revealed that the training data had a systematic gap — seasonal products were underrepresented because the training window did not cover a full seasonal cycle. The issue would have been caught by a thorough data quality review during development, but the team had rushed past that step to meet a tight deadline. The cost of fixing the model, retraining on expanded data, and managing the client relationship damage was $78,000 — more than the original budget overrun they were trying to avoid by cutting corners.

Quality assurance in AI delivery is fundamentally different from traditional software QA. AI systems are probabilistic rather than deterministic — they do not always produce the same output for the same input. They depend on data quality, which is variable and often outside your control. And they can fail silently — producing plausible but incorrect results that go undetected until they cause real-world damage. Building quality into AI delivery requires a comprehensive approach that addresses data quality, model quality, system quality, and process quality.

The Quality Assurance Framework

Layer 1: Data Quality

Data quality is the foundation of AI quality. Every model is only as good as the data it was trained on.

Data quality dimensions:

Completeness: Are there missing values, gaps in time periods, or absent categories?
Accuracy: Does the data correctly represent reality? Are labels correct?
Consistency: Is the data internally consistent? Do related fields agree?
Timeliness: Is the data current enough for the intended use?
Representativeness: Does the data reflect the full distribution of real-world inputs, including edge cases and minority categories?
Bias: Does the data contain systematic biases that could lead to unfair or inaccurate model behavior?

Data quality checks:

Profiling: Before starting model development, profile the data — distributions, missing values, outliers, correlations. Document findings and share with the client.
Validation rules: Define and implement automated checks for data quality dimensions (null checks, range checks, consistency checks, format checks).
Sample review: Manually review a random sample of records for accuracy and labeling quality.
Drift monitoring: Compare training data distributions to production data distributions to detect when the real-world data diverges from what the model was trained on.

Layer 2: Model Quality

Model quality encompasses performance, fairness, robustness, and explainability.

Model performance testing:

Holdout test set: Evaluate model performance on a dataset that was not used during training. This is the minimum quality gate for any model.
Cross-validation: For smaller datasets, use k-fold cross-validation to get a more reliable performance estimate.
Performance across segments: Evaluate model performance across different data segments (e.g., customer types, geographic regions, product categories). A model that performs well on average but poorly for specific segments may not be acceptable.
Edge case testing: Test model behavior on unusual or extreme inputs. What happens with missing features, outlier values, or combinations that are rare in the training data?
Performance thresholds: Define minimum acceptable performance metrics (accuracy, precision, recall, F1, AUC) before development begins. The model must meet these thresholds before advancing.

Fairness and bias testing:

Disparate impact analysis: Does the model produce different outcomes for different protected groups (race, gender, age, etc.)?
Equalized odds: Does the model have similar false positive and false negative rates across groups?
Bias metrics: Calculate relevant fairness metrics for the application domain.
Mitigation: If bias is detected, implement mitigation strategies (resampling, reweighting, fairness constraints).

Robustness testing:

Adversarial testing: Test model behavior when inputs are intentionally perturbed or manipulated.
Noise resilience: How does model performance degrade when input data has noise or errors?
Distribution shift: How does the model perform when the input distribution changes from what it was trained on?

Explainability review:

Feature importance: Which features drive the model's predictions? Are they reasonable and defensible?
Individual predictions: For high-stakes applications, can individual predictions be explained to stakeholders?
Consistency with domain knowledge: Do the model's patterns align with what domain experts expect?

Layer 3: System Quality

The model is just one component of the overall system. System quality ensures everything works together.

Integration testing:

Does the model integrate correctly with upstream data systems?
Does the model's output integrate correctly with downstream systems?
Are API contracts correct and complete?
Does error handling work as expected?

Performance testing:

Latency: Does the system respond within acceptable time limits?
Throughput: Can the system handle expected data volumes?
Scalability: Does performance degrade gracefully under load?
Resource utilization: Is compute, memory, and storage usage within expected bounds?

Security testing:

Are API endpoints secured with proper authentication and authorization?
Is data encrypted in transit and at rest?
Are there input validation protections against injection attacks?
Is access logging implemented?

Monitoring and alerting:

Are monitoring dashboards set up for key system metrics?
Are alerts configured for performance degradation, errors, and anomalies?
Is model performance monitoring in place to detect drift?
Are there runbooks for common alert scenarios?

Layer 4: Process Quality

Process quality ensures that the development and delivery process itself produces consistent results.

Code review: Every code change should be reviewed by at least one other engineer before merging:

Code quality: Is it clean, readable, and maintainable?
Logic correctness: Does it do what it is supposed to do?
Edge cases: Are edge cases handled?
Test coverage: Are there appropriate tests?
Security: Are there any security concerns?
Documentation: Is the code adequately documented?

Documentation review: All client-facing documentation should be reviewed:

Technical accuracy
Completeness
Clarity for the intended audience
Consistency with prior deliverables
Professional formatting and presentation

Deliverable review: Before any deliverable goes to the client, it should be reviewed by someone who did not create it:

Does it meet the acceptance criteria defined in the project plan?
Is it complete?
Is the quality consistent with agency standards?
Is it presented professionally?

The QA Process

Quality Gates

Quality gates are defined checkpoints in the project lifecycle where work must meet specific criteria before advancing.

Gate 1: Data readiness

Data has been profiled and quality assessed
Data quality issues are documented with remediation plan
Training, validation, and test datasets are created
Data documentation is complete

Gate 2: Baseline model

Baseline model is trained and evaluated
Performance metrics meet minimum thresholds
Initial fairness assessment is complete
Approach is validated by senior technical reviewer

Gate 3: Model ready for integration

Final model meets all performance thresholds
Fairness and bias testing is complete
Robustness testing is complete
Model documentation is complete
Senior technical review is passed

Gate 4: System ready for UAT

All integration points are tested
Performance testing is complete
Security review is passed
Monitoring and alerting are configured
System documentation is complete

Gate 5: Production ready

UAT is complete and client has signed off
All defects are resolved or accepted
Deployment runbook is complete and reviewed
Rollback procedure is tested
Production monitoring is validated

QA Roles and Responsibilities

Every engineer: Responsible for code quality, unit testing, and self-review Peer reviewer: Responsible for code review, spot-checking test coverage Technical lead: Responsible for architecture review, quality gate approval, and overall technical quality Project manager: Responsible for process quality, ensuring QA steps are followed, managing quality gate schedule QA specialist (if you have one): Responsible for test strategy, integration testing, and UAT coordination

QA Tooling

Automated testing:

Unit testing frameworks (pytest, jest)
Integration testing frameworks
CI/CD pipelines that run tests automatically on every code change
Model evaluation pipelines that calculate performance metrics automatically

Model quality tools:

Experiment tracking (MLflow, Weights & Biases)
Fairness assessment (IBM AI Fairness 360, Google What-If Tool)
Model monitoring (Evidently AI, Arize)

Code quality tools:

Linting (flake8, pylint, ESLint)
Formatting (black, prettier)
Static analysis (SonarQube, CodeClimate)

Measuring Quality

Quality metrics to track:

Defect rate: Defects found per deliverable or per sprint. Track trends over time.
Rework rate: Percentage of total hours spent on rework. Target: under 15%.
Test coverage: Percentage of code covered by automated tests. Target: 70%+.
Quality gate pass rate: Percentage of quality gates passed on first attempt. Target: 80%+.
Client-reported defects: Defects found by the client after delivery. Target: less than 2 per project.
Post-deployment incidents: Incidents in production caused by quality issues. Target: less than 1 per project.
Client satisfaction with quality: Survey score specific to deliverable quality. Target: 8+/10.

Your Next Step

This week:

Review your last three completed projects. How many quality issues were found after delivery? What was the cost of fixing them?
Check whether your current projects have defined quality gates. If not, add them.
Verify that code review is happening on all projects, consistently.

This month:

Define quality gates for your standard project lifecycle and document the criteria for each.
Implement automated testing in your CI/CD pipeline if you do not already have it.
Create a data quality checklist and use it at the start of every project.

This quarter:

Build a quality metrics dashboard and review it monthly.
Implement fairness and bias testing as a standard step for all models.
Conduct a quality retrospective across recent projects to identify patterns.
Invest in QA tooling where manual processes are creating bottlenecks or inconsistencies.

Quality is not a phase — it is a practice woven through everything you do. The cheapest defect to fix is the one you prevent. The most expensive is the one the client finds in production. Build quality into your process from the start, and you will spend less time fixing problems and more time creating value.

The Quality Assurance Framework

Layer 1: Data Quality

Data quality is the foundation of AI quality. Every model is only as good as the data it was trained on.

Data quality dimensions:

Completeness: Are there missing values, gaps in time periods, or absent categories?
Accuracy: Does the data correctly represent reality? Are labels correct?
Consistency: Is the data internally consistent? Do related fields agree?
Timeliness: Is the data current enough for the intended use?
Representativeness: Does the data reflect the full distribution of real-world inputs, including edge cases and minority categories?
Bias: Does the data contain systematic biases that could lead to unfair or inaccurate model behavior?

Data quality checks:

Profiling: Before starting model development, profile the data — distributions, missing values, outliers, correlations. Document findings and share with the client.
Validation rules: Define and implement automated checks for data quality dimensions (null checks, range checks, consistency checks, format checks).
Sample review: Manually review a random sample of records for accuracy and labeling quality.
Drift monitoring: Compare training data distributions to production data distributions to detect when the real-world data diverges from what the model was trained on.

Layer 2: Model Quality

Model quality encompasses performance, fairness, robustness, and explainability.

Model performance testing:

Holdout test set: Evaluate model performance on a dataset that was not used during training. This is the minimum quality gate for any model.
Cross-validation: For smaller datasets, use k-fold cross-validation to get a more reliable performance estimate.
Performance across segments: Evaluate model performance across different data segments (e.g., customer types, geographic regions, product categories). A model that performs well on average but poorly for specific segments may not be acceptable.
Edge case testing: Test model behavior on unusual or extreme inputs. What happens with missing features, outlier values, or combinations that are rare in the training data?
Performance thresholds: Define minimum acceptable performance metrics (accuracy, precision, recall, F1, AUC) before development begins. The model must meet these thresholds before advancing.

Fairness and bias testing:

Disparate impact analysis: Does the model produce different outcomes for different protected groups (race, gender, age, etc.)?
Equalized odds: Does the model have similar false positive and false negative rates across groups?
Bias metrics: Calculate relevant fairness metrics for the application domain.
Mitigation: If bias is detected, implement mitigation strategies (resampling, reweighting, fairness constraints).

Robustness testing:

Adversarial testing: Test model behavior when inputs are intentionally perturbed or manipulated.
Noise resilience: How does model performance degrade when input data has noise or errors?
Distribution shift: How does the model perform when the input distribution changes from what it was trained on?

Explainability review:

Feature importance: Which features drive the model's predictions? Are they reasonable and defensible?
Individual predictions: For high-stakes applications, can individual predictions be explained to stakeholders?
Consistency with domain knowledge: Do the model's patterns align with what domain experts expect?

Layer 3: System Quality

The model is just one component of the overall system. System quality ensures everything works together.

Integration testing:

Does the model integrate correctly with upstream data systems?
Does the model's output integrate correctly with downstream systems?
Are API contracts correct and complete?
Does error handling work as expected?

Performance testing:

Latency: Does the system respond within acceptable time limits?
Throughput: Can the system handle expected data volumes?
Scalability: Does performance degrade gracefully under load?
Resource utilization: Is compute, memory, and storage usage within expected bounds?

Security testing:

Are API endpoints secured with proper authentication and authorization?
Is data encrypted in transit and at rest?
Are there input validation protections against injection attacks?
Is access logging implemented?

Monitoring and alerting:

Are monitoring dashboards set up for key system metrics?
Are alerts configured for performance degradation, errors, and anomalies?
Is model performance monitoring in place to detect drift?
Are there runbooks for common alert scenarios?

Layer 4: Process Quality

Process quality ensures that the development and delivery process itself produces consistent results.

Code review: Every code change should be reviewed by at least one other engineer before merging:

Code quality: Is it clean, readable, and maintainable?
Logic correctness: Does it do what it is supposed to do?
Edge cases: Are edge cases handled?
Test coverage: Are there appropriate tests?
Security: Are there any security concerns?
Documentation: Is the code adequately documented?

Documentation review: All client-facing documentation should be reviewed:

Technical accuracy
Completeness
Clarity for the intended audience
Consistency with prior deliverables
Professional formatting and presentation

Deliverable review: Before any deliverable goes to the client, it should be reviewed by someone who did not create it:

Does it meet the acceptance criteria defined in the project plan?
Is it complete?
Is the quality consistent with agency standards?
Is it presented professionally?

The QA Process

Quality Gates

Quality gates are defined checkpoints in the project lifecycle where work must meet specific criteria before advancing.

Gate 1: Data readiness

Data has been profiled and quality assessed
Data quality issues are documented with remediation plan
Training, validation, and test datasets are created
Data documentation is complete

Gate 2: Baseline model

Baseline model is trained and evaluated
Performance metrics meet minimum thresholds
Initial fairness assessment is complete
Approach is validated by senior technical reviewer

Gate 3: Model ready for integration

Final model meets all performance thresholds
Fairness and bias testing is complete
Robustness testing is complete
Model documentation is complete
Senior technical review is passed

Gate 4: System ready for UAT

All integration points are tested
Performance testing is complete
Security review is passed
Monitoring and alerting are configured
System documentation is complete

Gate 5: Production ready

UAT is complete and client has signed off
All defects are resolved or accepted
Deployment runbook is complete and reviewed
Rollback procedure is tested
Production monitoring is validated

QA Roles and Responsibilities

QA Tooling

Automated testing:

Unit testing frameworks (pytest, jest)
Integration testing frameworks
CI/CD pipelines that run tests automatically on every code change
Model evaluation pipelines that calculate performance metrics automatically

Model quality tools:

Experiment tracking (MLflow, Weights & Biases)
Fairness assessment (IBM AI Fairness 360, Google What-If Tool)
Model monitoring (Evidently AI, Arize)

Code quality tools:

Linting (flake8, pylint, ESLint)
Formatting (black, prettier)
Static analysis (SonarQube, CodeClimate)

Measuring Quality

Quality metrics to track:

Defect rate: Defects found per deliverable or per sprint. Track trends over time.
Rework rate: Percentage of total hours spent on rework. Target: under 15%.
Test coverage: Percentage of code covered by automated tests. Target: 70%+.
Quality gate pass rate: Percentage of quality gates passed on first attempt. Target: 80%+.
Client-reported defects: Defects found by the client after delivery. Target: less than 2 per project.
Post-deployment incidents: Incidents in production caused by quality issues. Target: less than 1 per project.
Client satisfaction with quality: Survey score specific to deliverable quality. Target: 8+/10.

Your Next Step

This week:

Review your last three completed projects. How many quality issues were found after delivery? What was the cost of fixing them?
Check whether your current projects have defined quality gates. If not, add them.
Verify that code review is happening on all projects, consistently.

This month:

Define quality gates for your standard project lifecycle and document the criteria for each.
Implement automated testing in your CI/CD pipeline if you do not already have it.
Create a data quality checklist and use it at the start of every project.

This quarter:

Build a quality metrics dashboard and review it monthly.
Implement fairness and bias testing as a standard step for all models.
Conduct a quality retrospective across recent projects to identify patterns.
Invest in QA tooling where manual processes are creating bottlenecks or inconsistencies.

Quality Assurance Playbook for AI Delivery — Building Quality Into Every Project

The Quality Assurance Framework

Layer 1: Data Quality

Layer 2: Model Quality

Layer 3: System Quality

Layer 4: Process Quality

The QA Process

Quality Gates

QA Roles and Responsibilities

QA Tooling

Measuring Quality

Your Next Step

Agency Script Editorial

Related Articles

Understaffed or Overstaffed? Both Camps Were Right.

Optimizing Daily Standups for Distributed AI Agency Teams

Complete Utilization Rate Management Guide — The Metric That Makes or Breaks Agency Profitability

Ready to certify your AI capability?

Quality Assurance Playbook for AI Delivery — Building Quality Into Every Project

The Quality Assurance Framework

Layer 1: Data Quality

Layer 2: Model Quality

Layer 3: System Quality

Layer 4: Process Quality

The QA Process

Quality Gates

QA Roles and Responsibilities

QA Tooling

Measuring Quality

Your Next Step

Agency Script Editorial

Related Articles

Understaffed or Overstaffed? Both Camps Were Right.

Optimizing Daily Standups for Distributed AI Agency Teams

Complete Utilization Rate Management Guide — The Metric That Makes or Breaks Agency Profitability

Ready to certify your AI capability?