AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Why ML Needs Its Own CI/CDThe ML CI/CD PipelineStage 1: Code ValidationStage 2: Data ValidationStage 3: Model TrainingStage 4: Model EvaluationStage 5: Model RegistrationStage 6: Staging ValidationStage 7: Production DeploymentInfrastructure for ML CI/CDSource ControlCI/CD PlatformModel RegistryArtifact StorageDelivery ProcessPhase 1: Assessment and Design (Weeks 1-3)Phase 2: Foundation (Weeks 4-8)Phase 3: Evaluation and Gates (Weeks 9-13)Phase 4: Advanced Capabilities (Weeks 14-18)Common ML CI/CD Anti-PatternsML CI/CD for Different Team SizesSmall Teams (2-5 ML engineers)Medium Teams (5-15 ML engineers)Large Teams (15+ ML engineers)Measuring CI/CD EffectivenessML CI/CD and Model GovernancePricing ML CI/CD EngagementsYour Next Step
Home/Blog/From Notebook to Production by Email: An 11-Day Slog
Delivery

From Notebook to Production by Email: An 11-Day Slog

A

Agency Script Editorial

Editorial Team

ยทMarch 21, 2026ยท14 min read
ml ci/cdmlops automationcontinuous delivery aiml pipeline automation

A digital marketing company had a data science team of eight building recommendation and targeting models. Their deployment process was manual: a data scientist would train a model in a notebook, export the model artifact, email it to the ML engineer, who would manually deploy it to a staging server, run some tests, and then manually promote to production. The average time from trained model to production deployment was 11 days. During that time, three different people were involved, two handoffs could introduce errors, and the lack of automated testing meant quality was inconsistent. When the company grew to 15 models in production with monthly refresh cycles, the deployment process consumed 40 percent of the ML engineer's time. An AI agency built a CI/CD pipeline for their ML workflows. When a data scientist commits a model training change, the pipeline automatically trains the model, runs evaluation tests, compares performance against the current production model, and if it passes all gates, deploys to production with a canary rollout. Average deployment time dropped from 11 days to 4 hours. The ML engineer was freed to build infrastructure instead of babysitting deployments. And deployment errors โ€” which had caused three production incidents in the previous quarter โ€” dropped to zero.

Why ML Needs Its Own CI/CD

Traditional software CI/CD does not work for ML because ML systems have fundamentally different components that change independently.

In traditional software, the code is the product. CI/CD tests and deploys code changes.

In ML, the system has three products that change independently:

  • Code: The pipeline code, feature engineering logic, model architecture, and serving infrastructure
  • Data: The training data, feature data, and reference data that the model depends on
  • Models: The trained model artifacts that result from code + data

A change to any one of these can break the system. Traditional CI/CD handles code changes. ML CI/CD must handle all three.

This means ML CI/CD needs additional capabilities:

  • Data validation: When training data changes, validate that the new data meets quality requirements before retraining
  • Model evaluation: When a new model is trained, evaluate it against comprehensive test suites before promoting
  • Model comparison: Compare the new model against the current production model to verify improvement
  • Artifact management: Track and version model artifacts with their associated code and data versions
  • Deployment strategies: Support ML-specific deployment strategies (shadow deployment, canary, A/B testing) that are not standard in software CI/CD

The ML CI/CD Pipeline

Stage 1: Code Validation

Triggered by code changes (commits to the ML repository).

Steps:

  1. Linting and formatting: Enforce code style standards
  2. Unit tests: Test individual functions โ€” feature engineering logic, data processing functions, utility code
  3. Integration tests: Test that pipeline components work together โ€” data loading, feature computation, model training, model serving
  4. Static analysis: Check for common ML code issues (data leakage, non-deterministic operations, hardcoded values)

Stage 2: Data Validation

Triggered by data changes (new data arrives, training dataset is updated).

Steps:

  1. Schema validation: Verify that the data schema matches expectations (column names, data types, non-nullable fields)
  2. Statistical validation: Verify that data distributions are within expected ranges. Detect sudden shifts that might indicate data pipeline issues.
  3. Volume validation: Verify that the data volume is within expected ranges. A 50 percent drop in training data volume warrants investigation.
  4. Quality validation: Run data quality checks (completeness, consistency, freshness)
  5. Drift detection: Compare current data distributions against a reference to detect meaningful drift that might require model retraining or investigation

Stage 3: Model Training

Triggered by code changes, data changes, or scheduled retraining.

Steps:

  1. Environment setup: Provision training infrastructure with the correct dependencies, GPU type, and resource allocation
  2. Data preparation: Load and prepare training data using the validated dataset
  3. Model training: Execute the training job with logging of all hyperparameters, metrics, and artifacts
  4. Artifact storage: Save the trained model, training metrics, and configuration to the model registry

Stage 4: Model Evaluation

Triggered by the completion of model training.

Steps:

  1. Benchmark evaluation: Test the model against a curated benchmark dataset. Compute all standard metrics (accuracy, precision, recall, F1, AUC, etc.).
  2. Fairness evaluation: Test for performance disparities across protected groups
  3. Robustness evaluation: Test against edge cases, adversarial inputs, and out-of-distribution data
  4. Performance comparison: Compare the new model against the current production model. The new model must meet a minimum improvement threshold to proceed.
  5. Cost evaluation: Estimate the inference cost of the new model. Flag if significantly more expensive than the current model.

Evaluation gates:

Define clear pass/fail criteria for each evaluation. Example:

  • Accuracy must be within 1 percent of training accuracy on the benchmark
  • F1 must be higher than the current production model
  • Fairness gap must be under 5 percent across all protected groups
  • Inference latency must be under 100ms at p95

Stage 5: Model Registration

Triggered by passing all evaluation gates.

Steps:

  1. Register the model in the model registry with version, metadata, and lineage information
  2. Promote to staging: Move the model artifact to the staging environment
  3. Update documentation: Auto-generate or update model cards with the latest evaluation results

Stage 6: Staging Validation

Testing the model in a production-like environment before deploying to production.

Steps:

  1. Integration testing: Test the model with the production serving infrastructure, feature pipeline, and monitoring
  2. Load testing: Verify that the model meets latency and throughput requirements under production-like load
  3. Smoke testing: Run a set of representative real-world inputs and verify outputs are reasonable
  4. Compliance checks: Verify all governance and compliance requirements are met

Stage 7: Production Deployment

Steps:

  1. Deploy with canary strategy: Route a small percentage of traffic (5 to 10 percent) to the new model
  2. Monitor canary metrics: Track prediction quality, latency, error rate, and business metrics for the canary population
  3. Compare canary vs. control: Verify that the canary population performs at least as well as the control
  4. Gradual rollout: If canary passes, gradually increase traffic to the new model (25 percent, 50 percent, 100 percent)
  5. Rollback capability: If any metric degrades during rollout, automatically rollback to the previous model version

Infrastructure for ML CI/CD

Source Control

Use Git for all ML artifacts โ€” pipeline code, feature engineering code, model architecture code, configuration files, and small test datasets. Use a branching strategy (main/develop/feature branches) and require pull request reviews for all changes.

CI/CD Platform

GitHub Actions, GitLab CI, or Jenkins for orchestrating the CI/CD pipeline. These platforms trigger pipeline stages on code changes and schedule periodic runs.

Considerations for ML:

  • ML CI/CD jobs require GPU resources that standard CI/CD runners do not have. Configure self-hosted runners with GPU access for training and evaluation stages.
  • ML jobs are long-running (minutes to hours, not seconds). Configure appropriate timeouts and consider asynchronous pipeline stages with callbacks.

Model Registry

MLflow Model Registry, Vertex AI Model Registry, or SageMaker Model Registry for versioning, staging, and promoting model artifacts. The model registry is the bridge between training and deployment.

Artifact Storage

Model artifacts, evaluation results, and deployment configurations in versioned object storage (S3, GCS) with clear naming conventions and lifecycle policies.

Delivery Process

Phase 1: Assessment and Design (Weeks 1-3)

  • Map the current model development and deployment workflow
  • Identify manual steps, bottlenecks, and error-prone handoffs
  • Define the target CI/CD pipeline stages and gates
  • Select technology components
  • Design the pipeline architecture

Phase 2: Foundation (Weeks 4-8)

  • Set up the CI/CD platform with ML-capable runners
  • Implement the code validation stage (linting, unit tests, integration tests)
  • Set up the model registry
  • Implement basic model training automation
  • Build the deployment pipeline with rollback capability

Phase 3: Evaluation and Gates (Weeks 9-13)

  • Build the comprehensive evaluation pipeline
  • Implement the benchmark test suite
  • Implement fairness and robustness evaluation
  • Define and configure evaluation gates
  • Implement the model comparison pipeline

Phase 4: Advanced Capabilities (Weeks 14-18)

  • Implement data validation and drift detection
  • Build canary deployment automation
  • Implement automated monitoring integration
  • Build dashboards for pipeline health and model quality
  • Train the ML team on using the CI/CD pipeline

Common ML CI/CD Anti-Patterns

The "Data Science is Different" Anti-Pattern. Data scientists argue that ML development is inherently experimental and cannot be subjected to the same discipline as software development. They resist code reviews, automated testing, and standardized pipelines. The result is a collection of one-off notebooks that cannot be reproduced, tested, or deployed automatically. The fix: acknowledge that experimentation needs freedom, but draw a clear line between experimentation (which happens in notebooks with loose rules) and productionization (which goes through a rigorous CI/CD pipeline). The transition from experiment to production is where discipline is enforced.

The "Test in Production" Anti-Pattern. The team skips staging validation because "the real test is production." This works until it does not. A model that passes offline evaluation but fails on production traffic causes user-facing incidents that could have been caught in staging. The fix: staging validation must be mandatory for every deployment. Invest in making the staging environment as close to production as possible.

The "Retrain Everything Weekly" Anti-Pattern. The team schedules weekly retraining for all models regardless of whether the data has changed or the model has drifted. This wastes compute resources and increases the risk of introducing regressions. The fix: trigger retraining based on data drift detection or performance degradation, not on a fixed schedule. A model that is performing well on stable data does not need weekly retraining.

The "One Pipeline Fits All" Anti-Pattern. The team builds a single CI/CD pipeline template and forces every model through the same stages, regardless of the model's complexity, risk level, or deployment requirements. A simple logistic regression model goes through the same 7-stage pipeline as a complex ensemble. The fix: define pipeline tiers based on risk and complexity. Low-risk, simple models use a streamlined pipeline. High-risk, complex models use a comprehensive pipeline with additional gates and reviews.

The "No Rollback Plan" Anti-Pattern. The CI/CD pipeline can deploy new models but has no automated rollback capability. When a deployment causes a production issue, the team scrambles to manually deploy the previous version, which takes hours instead of minutes. The fix: rollback capability must be a first-class feature of the CI/CD pipeline. The previous production model should always be available for instant restoration.

ML CI/CD for Different Team Sizes

Small Teams (2-5 ML engineers)

Small teams should prioritize simplicity and automation of the most painful manual steps.

Focus on: Automated model training and evaluation. Basic data validation. Automated deployment with one-click rollback. Use managed services (GitHub Actions, Vertex AI Pipelines, SageMaker Pipelines) rather than building custom infrastructure.

Skip for now: Complex multi-stage canary deployments, automated retraining triggers, and advanced governance gates. These add value but add complexity that small teams cannot maintain.

Medium Teams (5-15 ML engineers)

Medium teams can support more pipeline sophistication and should invest in standardization.

Focus on: Standardized pipeline templates that every model uses. Automated evaluation gates with configurable thresholds. Canary deployment for high-traffic models. Data validation and drift detection. Model comparison dashboards.

Skip for now: Custom evaluation frameworks, automated A/B testing integration, and multi-environment deployment (unless required by regulation).

Large Teams (15+ ML engineers)

Large teams need enterprise-grade ML CI/CD with governance, compliance, and self-service capabilities.

Focus on: Self-service pipeline configuration for individual teams. Governance gates with role-based approvals. Automated compliance documentation generation. Multi-environment deployment (dev, staging, pre-prod, production). Advanced deployment strategies (shadow, canary, blue-green). Comprehensive audit trails.

Measuring CI/CD Effectiveness

Deployment frequency. How often are models deployed to production? Target: at least monthly for actively developed models, with the pipeline supporting same-day deployment when needed.

Lead time for changes. How long from a code commit to production deployment? Target: under 24 hours for model code changes, under 4 hours for configuration changes.

Deployment failure rate. What percentage of deployments require rollback? Target: under 5 percent. A high failure rate indicates insufficient evaluation gates.

Mean time to recovery. When a deployment fails, how long until the system is restored to a good state? Target: under 15 minutes with automated rollback.

Pipeline reliability. What percentage of pipeline runs complete successfully (not blocked by infrastructure issues, not failed due to flaky tests)? Target: over 95 percent. An unreliable pipeline becomes a bottleneck that teams work around rather than through.

ML CI/CD and Model Governance

For organizations in regulated industries, the CI/CD pipeline is not just a convenience โ€” it is a compliance tool. The pipeline creates an auditable record of every model change.

Traceability. Every model in production should be traceable back to the exact code version, data version, and configuration that produced it. The CI/CD pipeline creates this traceability automatically by linking every deployment to the Git commit, training run, and evaluation results that led to it.

Approval gates. For high-risk models (credit decisions, healthcare, safety-critical), the CI/CD pipeline should include human approval gates. A model that passes all automated checks is queued for human review by a model risk manager or compliance officer. The deployment only proceeds after explicit human approval.

Evidence packaging. When regulators ask for documentation of a model's development process, the CI/CD pipeline should be able to generate a complete evidence package โ€” the code, the data lineage, the evaluation results, the fairness tests, the approval record, and the deployment history. Automated evidence packaging saves weeks of manual documentation work during regulatory reviews.

Immutable audit trail. Every action in the CI/CD pipeline โ€” every commit, every training run, every evaluation, every approval, every deployment โ€” should be logged in an immutable audit trail. This trail provides evidence that the organization followed its defined processes and did not bypass safety gates.

Pricing ML CI/CD Engagements

  • CI/CD assessment and design: $10,000 to $25,000
  • Basic ML CI/CD pipeline: $40,000 to $100,000
  • Full ML CI/CD with evaluation gates and canary deployment: $80,000 to $200,000
  • Ongoing pipeline operations and maintenance: $5,000 to $15,000 per month

Your Next Step

This week: Map your client's current model deployment process. Count the manual steps, the handoffs, and the hours consumed. This data makes the case for CI/CD automation.

This month: Build a reference ML CI/CD pipeline for a single model on your preferred CI/CD platform. Include training, evaluation, model comparison, and automated deployment.

This quarter: Deliver your first ML CI/CD engagement. Start with the highest-volume model (the one deployed most frequently) and demonstrate the time savings before expanding to other models.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification