AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Why ML Pipelines Are Harder Than Software PipelinesThe Orchestration Tool LandscapeApache AirflowKubeflow PipelinesPrefectDagsterChoosing Your OrchestratorDesigning ML Pipelines for ProductionPipeline Architecture PatternsError Handling and RecoveryPricing Orchestration WorkYour Next Step
Home/Blog/Fourteen Pipeline Steps That Worked in the Demo, Not in Prod
Delivery

Fourteen Pipeline Steps That Worked in the Demo, Not in Prod

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท13 min read
ML pipelinesworkflow orchestrationAirflowKubeflow

Orchestrating Complex ML Pipelines: Airflow, Kubeflow, and Beyond for AI Agencies

An AI agency in San Francisco delivered a demand forecasting system to a grocery chain. The system worked beautifully โ€” during the demo. In production, it was a nightmare. The pipeline had 14 steps: data extraction from three source systems, data cleaning, feature engineering, model training, validation, model registration, deployment, monitoring setup, and notification. These steps were wired together with cron jobs and bash scripts. When step 7 failed at 3 AM because the feature store was temporarily unavailable, steps 8 through 14 ran anyway โ€” on stale data. The forecasting system produced wildly inaccurate predictions for 200 stores. By the time anyone noticed at 8 AM, the stores had already placed incorrect orders with suppliers.

The agency spent three weeks rebuilding the pipeline with Apache Airflow. Now, when step 7 fails, the pipeline pauses, retries three times, and sends an alert. Downstream steps do not run until upstream steps succeed. The same pipeline that was a liability became a reliable, observable, self-healing system.

Pipeline orchestration is the unglamorous backbone of production ML. It is not what agencies pitch in sales meetings. But it is what determines whether your models actually work reliably in the real world โ€” and whether your client calls you at 8 AM with a crisis or at 8 AM with a renewal.

Why ML Pipelines Are Harder Than Software Pipelines

Software CI/CD pipelines are relatively straightforward: build, test, deploy. ML pipelines are fundamentally more complex because they involve data dependencies, non-deterministic outputs, and long-running compute jobs.

Data dependencies are unpredictable. A software build depends on source code that changes when developers commit. An ML pipeline depends on data that changes continuously, often without warning. Source schemas evolve, data volumes fluctuate, quality degrades, and new edge cases appear constantly.

Steps have heterogeneous compute requirements. Data extraction might need network I/O. Feature engineering might need a Spark cluster. Model training might need GPUs. Validation might need a CPU. Each step has different resource requirements and different failure modes.

Execution times are variable. A software build takes roughly the same time every run. Model training might take 20 minutes or 4 hours depending on the data volume and convergence behavior. Orchestration needs to handle this variability gracefully.

Output validation is non-trivial. In software, a test either passes or fails. In ML, a trained model needs evaluation against multiple metrics, comparison against the current production model, and assessment of fairness and bias โ€” all before deciding whether to promote it.

Reproducibility matters. When a production model degrades, you need to rerun the exact pipeline that produced the previous good model โ€” same data version, same code version, same hyperparameters. Your orchestration system needs to track all of these.

The Orchestration Tool Landscape

Apache Airflow

What it is: The most widely adopted open-source workflow orchestration platform. Pipelines are defined as Python code (DAGs โ€” Directed Acyclic Graphs), executed on a scheduler, and monitored through a web UI.

Strengths for agency work:

  • Massive community and ecosystem โ€” plugins for every cloud service, database, and API
  • Python-native โ€” your data scientists can read and modify pipeline definitions
  • Battle-tested at scale โ€” used by Airbnb, Spotify, and thousands of enterprises
  • Rich monitoring and alerting โ€” task-level visibility, retry logic, SLA monitoring
  • Managed offerings available โ€” Amazon MWAA, Google Cloud Composer, Astronomer โ€” reduce operational burden

Limitations:

  • Not ML-specific โ€” no native concepts for models, experiments, or features
  • DAG definitions can become complex for large pipelines
  • The scheduler can become a bottleneck at very high task volumes
  • Kubernetes executor setup requires significant DevOps expertise

Best for: Agencies that need a general-purpose orchestrator for data-heavy ML pipelines, especially when the client already uses Airflow for ETL.

Kubeflow Pipelines

What it is: An ML-specific orchestration platform that runs on Kubernetes. Pipelines are defined as containers, with each step running in its own isolated container.

Strengths for agency work:

  • ML-native โ€” built-in concepts for experiments, model artifacts, and metrics
  • Container isolation โ€” each step runs in its own environment, eliminating dependency conflicts
  • Reproducibility โ€” every run records the exact container images, parameters, and artifacts used
  • GPU support โ€” first-class support for GPU workloads (training, inference)
  • Integration with ML tools โ€” native integrations with TensorFlow, PyTorch, and other ML frameworks

Limitations:

  • Requires Kubernetes โ€” significant infrastructure complexity if the client does not already run Kubernetes
  • Steeper learning curve than Airflow for data engineering-focused teams
  • Smaller community than Airflow
  • Can be over-engineered for simple pipelines

Best for: Agencies building ML-heavy pipelines (training, evaluation, deployment) for clients that already run Kubernetes or are willing to invest in it.

Prefect

What it is: A modern workflow orchestration platform designed to be simpler and more Pythonic than Airflow. Pipelines are defined as decorated Python functions.

Strengths for agency work:

  • Very Pythonic โ€” pipelines look like normal Python code with decorators, making them accessible to data scientists
  • Hybrid execution model โ€” orchestration runs in the cloud, execution can happen anywhere (local, cloud, Kubernetes)
  • Better handling of dynamic workflows โ€” DAGs do not need to be fixed at parse time
  • Strong observability โ€” detailed task state tracking and failure diagnostics
  • Managed cloud offering (Prefect Cloud) eliminates infrastructure management

Limitations:

  • Smaller ecosystem than Airflow โ€” fewer pre-built integrations
  • Newer platform โ€” less battle-tested at extreme scale
  • Some advanced features are behind the paid cloud offering

Best for: Agencies that want modern workflow orchestration without the operational complexity of Airflow, especially for teams where data scientists need to work directly with pipeline definitions.

Dagster

What it is: A data pipeline orchestrator focused on the software engineering experience โ€” with strong typing, testing, and development environment support.

Strengths for agency work:

  • Software-defined assets โ€” pipelines are defined in terms of the data assets they produce, not just the tasks they execute
  • Strong testing support โ€” unit test individual pipeline components before deploying
  • Development environment โ€” local development and testing without needing the full production infrastructure
  • Data lineage โ€” automatic tracking of how data assets relate to each other
  • Type system โ€” catches many errors at definition time rather than runtime

Limitations:

  • Smaller community than Airflow
  • Steeper learning curve for teams used to task-based orchestration
  • Fewer pre-built integrations

Best for: Agencies that prioritize code quality and testability in their pipeline definitions, and teams that think in terms of data assets rather than tasks.

Choosing Your Orchestrator

Here is the decision framework:

  • Client already uses Airflow: Use Airflow. Do not introduce a new orchestrator unless there is a compelling reason.
  • ML-heavy pipeline with Kubernetes: Use Kubeflow Pipelines.
  • Data science team needs to own the pipeline: Use Prefect. The Pythonic interface reduces the barrier for data scientists.
  • Software engineering team builds pipelines: Use Dagster. The development experience and testing support align with engineering best practices.
  • Simple pipeline (< 10 steps), no existing orchestrator: Use Prefect with the managed cloud. Fastest time to production.

Designing ML Pipelines for Production

Pipeline Architecture Patterns

Pattern 1: The Training Pipeline

The most common ML pipeline. Takes raw data, produces a trained and validated model.

Steps:

  1. Extract data from source systems
  2. Validate source data quality
  3. Transform and clean data
  4. Engineer features
  5. Split into train/validation/test sets
  6. Train model
  7. Evaluate on validation set
  8. Compare against current production model
  9. If better: register model, run bias checks, deploy to staging
  10. Run integration tests on staging
  11. Promote to production
  12. Update monitoring dashboards

Key design decisions:

  • Steps 1-5 should be idempotent. Running them twice on the same data should produce the same output. This enables safe retries.
  • Step 8 is your quality gate. Never promote a model to production without automated comparison against the incumbent. Define clear criteria: "new model must improve AUC by at least 0.5% without degrading any segment-level metric."
  • Step 9 (bias checks) should be a hard gate. If bias checks fail, the pipeline stops. Period. Do not make this a warning that someone might ignore.

Pattern 2: The Feature Pipeline

Runs on a schedule (hourly, daily) to keep the feature store current.

Steps:

  1. Extract incremental data from source systems
  2. Validate data freshness and quality
  3. Compute feature updates
  4. Validate computed features (no nulls, within expected ranges)
  5. Write to the feature store (online and offline)
  6. Update feature metadata (last updated timestamp, row counts)
  7. Alert if any features are stale or degraded

Key design decisions:

  • Idempotency is critical. If the pipeline runs twice for the same time window, it should produce the same features. Design with "upsert" semantics, not "append."
  • Freshness SLAs drive scheduling. If the client needs features updated every hour, the pipeline must complete within the hour. Build in buffer time.
  • Feature validation should catch drift. Compare computed feature distributions against historical baselines. Alert on significant shifts.

Pattern 3: The Monitoring Pipeline

Runs continuously or on a tight schedule to detect production issues.

Steps:

  1. Collect prediction logs from the inference service
  2. Compute performance metrics (if labels are available)
  3. Compute distribution metrics (prediction distributions, feature distributions)
  4. Compare against baselines
  5. Generate alerts for threshold breaches
  6. Produce monitoring dashboards and reports
  7. Trigger retraining pipeline if degradation exceeds threshold

Key design decisions:

  • Label delay handling. In many applications, ground truth labels arrive days or weeks after predictions. The monitoring pipeline needs to join predictions with their eventual labels when they become available.
  • Alerting should be actionable. "Model performance degraded" is not actionable. "Prediction precision for segment X dropped below 80% threshold, likely due to feature Y distribution shift" is actionable.

Error Handling and Recovery

Retry with exponential backoff. Transient failures (network timeouts, temporary service unavailability) are common. Configure retries with increasing wait times: 30 seconds, 2 minutes, 10 minutes, 1 hour.

Dead letter queues for data pipelines. Records that fail validation should be routed to a dead letter queue for investigation, not silently dropped or allowed to block the pipeline.

Checkpoint and resume. Long-running steps (model training) should checkpoint progress so that a failure at 90% completion does not require starting from scratch.

Circuit breakers for external dependencies. If a source system is down, the pipeline should not keep hammering it. Implement circuit breakers that pause and alert after a threshold of failures.

Manual intervention hooks. Some failures require human judgment. Design your pipeline to pause and wait for human input at critical decision points โ€” like when the newly trained model is worse than the current production model but there is a known data issue that might explain it.

Pricing Orchestration Work

Pipeline orchestration is infrastructure work that agencies often underquote because it seems less exciting than model development. Do not underquote it โ€” it is critical and time-consuming.

Pipeline development:

  • Simple pipeline (5-8 steps, single data source): $15,000 - $30,000
  • Standard pipeline (10-15 steps, multiple data sources, quality gates): $30,000 - $60,000
  • Complex pipeline (20+ steps, multiple environments, advanced error handling): $60,000 - $120,000

Pipeline operations retainer: $3,000 - $8,000 per month for monitoring, troubleshooting, and evolution.

Frame the value to clients: "A reliable pipeline means your models produce accurate predictions every day without human intervention. An unreliable pipeline means your team spends Monday mornings debugging Saturday's failures instead of building new capabilities."

Your Next Step

Audit the orchestration of your current client deployments. For each one, answer these questions: What happens when step N fails? Does the pipeline retry, alert, or silently continue? Can you rerun the pipeline from an arbitrary checkpoint? Do you have visibility into which step is running and how long each step takes? If the answers are "I do not know," that is your cue to implement proper orchestration before the next production incident forces you to.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification