An MLOps-focused agency in Portland managed a fraud detection system for an online marketplace. The system processed 4.2 million transactions daily with 91% precision at a 2% false positive rate. One Monday morning, the client called to report a surge in fraudulent transactions slipping through. The agency investigated and found that a pipeline failure three weeks earlier had caused a key feature โ the buyer's 90-day transaction velocity โ to stop updating. The feature values were frozen at their three-week-old values, but the pipeline did not error out because the stale values were still valid numbers. The model continued making predictions using stale data, and its real-world precision had degraded from 91% to 74% without any monitoring alert triggering. The root cause was a silent failure in a join operation that started returning empty results after a schema change in an upstream data source. The agency had unit tests for the model and integration tests for the API, but no tests for the data pipeline that connected them. That gap cost the client an estimated $840,000 in undetected fraud.
Testing AI data pipelines is the practice of systematically verifying that the data flowing into your ML models is correct, complete, fresh, and consistent. For AI agencies, pipeline testing is arguably more important than model testing โ a perfect model fed bad data produces bad predictions, and pipeline failures are far more common than model failures in production systems.
Why Pipeline Testing Is Different
Traditional Software vs. Data Pipeline Testing
Traditional software tests verify that code produces the correct output for a given input. Data pipeline tests verify that data meets expected properties โ and the "expected properties" change as the real world changes.
Key differences:
- Non-determinism: Data pipeline outputs depend on external data sources that change continuously. You cannot write a test that expects a specific output value.
- Schema evolution: Upstream data sources change their schemas without warning. A column renamed, a data type changed, or a new null pattern introduced can break a pipeline silently.
- Volume sensitivity: A pipeline that works on 1,000 rows may fail on 10 million rows due to memory limits, timeout thresholds, or join explosions.
- Temporal correctness: Features must be computed as of the correct point in time. A feature that accidentally looks into the future (data leakage) will produce excellent test metrics and terrible production performance.
- Silent failures: The most dangerous pipeline bugs do not cause errors โ they produce plausible but incorrect data. A join that returns zero rows where it should return thousands still produces a valid (empty) result.
The Testing Pyramid for AI Pipelines
Level 1 โ Schema tests: Verify that data conforms to expected schemas โ correct columns, correct data types, no unexpected nulls.
Level 2 โ Value tests: Verify that data values fall within expected ranges and distributions โ no negative ages, no future dates, no out-of-range feature values.
Level 3 โ Freshness tests: Verify that data is sufficiently recent โ features were computed from today's data, not stale data from a previous run.
Level 4 โ Statistical tests: Verify that data distributions have not shifted dramatically โ the mean, variance, and percentiles of features are within expected bounds.
Level 5 โ Cross-pipeline tests: Verify consistency across related pipelines โ training features match serving features, feature values in the feature store match source-of-truth values in the data warehouse.
Level 6 โ End-to-end tests: Verify that the complete pipeline โ from raw data ingestion through feature computation to model prediction โ produces correct outputs on known test cases.
Schema and Structure Tests
Column-Level Validation
For every dataset produced by the pipeline, validate the schema against expectations.
Schema tests:
- Column presence: Verify that all expected columns exist. A missing column indicates an upstream change or a pipeline bug.
- Column types: Verify that each column has the expected data type. A column that silently changes from integer to string will produce model inference errors.
- Column order (if order-dependent): Some systems depend on column order rather than names. Verify order has not changed.
- No unexpected columns: New columns may indicate upstream changes that warrant investigation.
Null value tests:
- Required columns: Verify that columns that should never be null have zero nulls.
- Null rate thresholds: For columns that can contain nulls, verify that the null rate is within expected bounds. A column that is normally 2% null becoming 80% null indicates a problem.
- Null patterns: Some null patterns are informative. Verify that null patterns match expectations (nulls in column A always co-occur with nulls in column B).
Row-Level Validation
Row count tests:
- Verify that the row count is within expected bounds. A table that normally has 1 million rows suddenly having 10,000 rows indicates a data loss.
- Verify row count against the source system count (the pipeline should not drop or duplicate rows).
Uniqueness tests:
- Verify that primary key columns contain unique values. Duplicate keys indicate a join or aggregation error.
- Verify that unique constraints from the business logic are maintained.
Referential integrity tests:
- Verify that foreign key relationships are valid โ every customer_id in the transactions table exists in the customers table.
- Invalid references indicate stale data, mismatched update schedules, or join errors.
Value and Range Tests
Statistical Bounds
For numerical features, verify that values fall within expected statistical bounds.
Static bounds:
- Age must be between 0 and 120
- Probability scores must be between 0 and 1
- Temperature must be within physically plausible ranges for the domain
- Revenue must be non-negative (or negative only for refunds)
Dynamic bounds:
- The mean of a feature should be within N standard deviations of the historical mean (accounting for expected trends)
- The maximum value should not exceed M times the historical maximum (detect outlier injection)
- The variance should be within a factor of K of the historical variance (detect both increased noise and suspiciously decreased variance)
Distribution Tests
Statistical distribution tests:
- Kolmogorov-Smirnov test: Compare the current batch's distribution to the reference distribution. Flag if the KS statistic exceeds a threshold.
- Population Stability Index (PSI): Compare binned distributions between current and reference. PSI above 0.2 indicates significant distribution change.
- Chi-squared test: For categorical features, compare category frequencies against expected frequencies.
When to use distribution tests:
- After each pipeline run to detect data drift or pipeline errors
- When ingesting data from a new source for the first time
- After any upstream schema or logic change
- As part of the model monitoring pipeline (distribution tests on model inputs and outputs)
Freshness and Timeliness Tests
Data Freshness Validation
Timestamp-based freshness:
- Verify that the most recent timestamp in the data is within expected bounds (within the last hour for hourly pipelines, within the last day for daily pipelines)
- Verify that there are no unexpected gaps in the timestamp sequence
- Verify that the data covers the expected time range (today's feature pipeline should process yesterday's events, not events from two weeks ago)
Processing timestamp tracking:
- Record when each pipeline stage completes
- Verify that processing completed within the expected time window
- Alert if processing is running late โ late features mean stale predictions
Feature Freshness for Serving
For real-time serving pipelines, verify that the features served to the model are fresh.
Feature freshness tests:
- Compare the timestamp of the served feature value to the current time
- Alert if any feature is more than N minutes old (where N depends on the feature's update frequency)
- Track the distribution of feature ages across serving requests โ increasing ages indicate pipeline latency or failure
The stale feature problem:
Many feature stores return the most recent available value when a fresh value is not available. This means stale features do not cause errors โ they silently degrade prediction quality. Explicit freshness tests are the only way to catch this.
Cross-Pipeline Consistency Tests
Training-Serving Skew
The most insidious AI pipeline bug is a difference between how features are computed during training and how they are computed during serving. This is called training-serving skew.
Common causes of training-serving skew:
- Code duplication: Feature computation logic is implemented separately for training (in Python/Spark) and serving (in a different language or system). Even subtle differences (rounding, null handling, timezone handling) cause skew.
- Data source differences: Training features are computed from the data warehouse (which may have different data than the real-time data sources used for serving).
- Temporal leakage: Training features accidentally include future information that is not available at serving time.
- Preprocessing differences: Text normalization, encoding, or scaling is applied differently during training and serving.
Training-serving skew detection:
- Log the features computed during serving alongside the model predictions
- Periodically recompute those same features using the training pipeline logic on the same input data
- Compare the two sets of features โ they should be identical (or very close) for the same entity and timestamp
- Any discrepancy indicates training-serving skew that needs investigation
Feature Store Consistency
If you use a feature store, verify consistency between the feature store and the source systems.
Consistency tests:
- For a sample of entities, compute features directly from source data and compare to feature store values
- Verify that feature store timestamps match the expected update cadence
- Check that the online store (serving) and offline store (training) contain consistent values for the same entity and timestamp
End-to-End Pipeline Tests
Known-Input Tests
Create a set of known test inputs with expected outputs and run them through the complete pipeline.
Test input design:
- Include typical inputs that should produce normal outputs
- Include edge cases: null values, extreme values, unusual combinations
- Include adversarial inputs: inputs designed to trigger known failure modes
- Update test inputs as the pipeline evolves
Expected output specification:
- For deterministic pipelines: specify exact expected outputs
- For non-deterministic pipelines: specify expected value ranges and statistical properties
- For ML pipelines: specify expected prediction ranges and confidence levels
Canary Pipeline Runs
Before processing the full production data, run the pipeline on a small sample and validate the outputs.
Canary pipeline process:
- Sample 1-5% of the input data
- Run the pipeline on the sample
- Validate outputs against all test levels (schema, value, distribution, freshness)
- If validation passes, run the pipeline on the full dataset
- If validation fails, halt the pipeline and alert the team
Implementation Tools
Great Expectations
The most popular open-source data testing framework. Define "expectations" (tests) for your data in Python, run them against your pipeline outputs, and generate validation reports.
Key features:
- Rich library of built-in expectations (column types, null rates, value ranges, uniqueness, distributions)
- Custom expectations for domain-specific tests
- Data documentation generation (automatic profiling of datasets)
- Integration with Airflow, Dagster, Prefect, and other orchestrators
dbt Tests
For SQL-based pipelines, dbt (data build tool) includes a testing framework that runs SQL-based assertions against pipeline outputs.
Built-in tests:
- unique, notnull, acceptedvalues, relationships
- Custom SQL tests for any assertion expressible in SQL
Soda
A data quality testing tool that supports both SQL and Python-based tests, with a focus on monitoring data quality over time.
Key features:
- Declarative test configuration in YAML
- Automated anomaly detection on data quality metrics
- Integration with common data platforms (BigQuery, Snowflake, Redshift, PostgreSQL)
Custom Testing Frameworks
For complex AI pipelines, a custom testing framework may be necessary.
Custom framework components:
- Test definitions (Python functions or configuration files)
- Test runner (executes tests and collects results)
- Result storage (database for historical test results)
- Alerting (notifications for failed tests)
- Dashboard (visualization of test results over time)
Pipeline Testing in CI/CD
Automated Testing Integration
Integrate pipeline tests into the CI/CD pipeline:
- Run schema and value tests after every pipeline code change (in the development environment)
- Run distribution tests on a sample of production-like data before deploying pipeline changes
- Run end-to-end tests in a staging environment before promoting to production
- Run all tests on a schedule (daily or after each pipeline run) in production
Testing Environments
Development environment: Test pipeline code changes against a small, representative dataset. Focus on schema tests, value tests, and known-input tests.
Staging environment: Test pipeline changes against a larger, production-like dataset. Focus on distribution tests, performance tests, and end-to-end tests.
Production environment: Monitor pipeline outputs after every run. Focus on freshness tests, distribution tests, and consistency tests.
Your Next Step
Audit one production AI pipeline your agency operates. For each stage of the pipeline (ingestion, transformation, feature computation, model serving), answer: what tests exist today? What failures would go undetected? Start with the simplest, highest-impact test: add a freshness check that verifies the most recent feature values are less than 2x the expected update interval. Then add null rate checks on the 10 most important features. These two tests take an afternoon to implement and would have caught the most common silent pipeline failures we see across agency deployments. Build from there โ add distribution tests, consistency tests, and end-to-end tests incrementally. Pipeline testing is not a one-time effort; it is a practice that grows alongside your pipeline.