AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Why Offline Metrics LieA/B Testing Architecture for ML ModelsTraffic SplittingModel Serving for A/B TestsMetric CollectionStatistical AnalysisDesigning the A/B TestChoosing Primary and Guardrail MetricsDetermining Sample Size and DurationTest Pre-RegistrationCommon A/B Testing Mistakes in MLAutomating A/B Testing for AgenciesPricing A/B Testing as Part of Model DeliveryYour Next Step
Home/Blog/A 4% AUC Gain That Crashed Loan Approvals by 18%
Delivery

A 4% AUC Gain That Crashed Loan Approvals by 18%

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท12 min read
A/B testingmodel deploymentproduction MLexperimentation

A/B Testing ML Models in Production: The Agency Guide to Safe Deployments

A fintech agency in San Francisco deployed an updated credit scoring model for a lending client. The new model showed a 4% improvement in AUC on the holdout test set. Everyone was confident. They deployed it with a full traffic cutover on a Monday morning. By Wednesday, the client noticed that loan approval rates had dropped 18% even though the model was supposedly "better." The issue: the new model was better at ranking risk but had been calibrated on different thresholds, and nobody had validated the approval rate impact before going live. The client lost three days of optimized lending decisions, and the agency had to roll back to the old model while they diagnosed the issue.

After that incident, the agency implemented mandatory A/B testing for every model deployment. The new workflow: deploy the updated model to 10% of traffic. Monitor business metrics for two weeks. If the new model meets or exceeds all guardrail metrics, gradually increase to 100%. If any guardrail is breached, automatically roll back.

The next model update went smoothly. The 10% test group showed a 3.8% AUC improvement with no change in approval rates or default rates. The rollout to 100% was seamless. No panicked phone calls. No emergency rollbacks. No client trust damage.

A/B testing ML models is not optional โ€” it is the difference between professional AI delivery and reckless deployment. And yet, most agencies skip it, shipping models directly to production with nothing but offline metrics and hope.

Why Offline Metrics Lie

Every data scientist knows that holdout test set performance does not guarantee production performance. But very few agencies act on that knowledge. Here is why offline metrics can be misleading:

Distribution shift between test data and production data. Your holdout set is a snapshot of historical data. Production data is live, evolving, and subject to trends, seasonality, and external events that your test set does not capture.

Proxy metric mismatch. You optimize for AUC or F1 score. The business cares about revenue, conversion rate, or customer satisfaction. Better AUC does not always translate to better business outcomes.

Feedback loop effects. In many applications, the model's predictions influence the data it sees next. A recommendation model changes what users click on. A pricing model changes demand patterns. These feedback effects are invisible in offline evaluation.

Calibration differences. Two models can have the same AUC but very different calibration โ€” the probability scores they produce do not map to real-world frequencies the same way. This affects every downstream decision that uses the probability as input.

Feature computation differences. The features available at prediction time may differ subtly from the features in the training set. Stale caches, missing data, timing differences โ€” any of these can make production features different from training features.

A/B testing bypasses all of these issues. Instead of predicting how the model will perform in production, you directly measure how it performs in production. The signal is real, not estimated.

A/B Testing Architecture for ML Models

Traffic Splitting

The foundation of model A/B testing is deterministic traffic splitting โ€” assigning each user or request to a model variant consistently.

Hash-based splitting: Hash the user ID (or session ID for anonymous users) and use the hash to assign variants. User 12345 always gets Model A; user 67890 always gets Model B. This ensures consistent experience and prevents the same user from seeing different model behaviors across visits.

Stratified splitting: Ensure that important user segments are proportionally represented in both variants. If 20% of your users are premium subscribers, both the control and treatment groups should have approximately 20% premium subscribers.

Implementation options:

  • Feature flags (LaunchDarkly, Unleash, Flagsmith): The simplest approach for most agencies. Configure model routing through a feature flag that controls traffic percentages.
  • API gateway routing: Use your API gateway's traffic splitting capabilities to route a percentage of requests to different model endpoints.
  • Custom routing service: Build a lightweight service that receives prediction requests, determines the variant, routes to the appropriate model, and logs the assignment.

Model Serving for A/B Tests

You need both models running simultaneously. Two common patterns:

Separate endpoints: Deploy each model variant as a separate service with its own endpoint. The routing layer directs traffic to the appropriate endpoint.

  • Advantages: Complete isolation between models. Independent scaling. Easy rollback (just change the routing).
  • Disadvantages: Double the infrastructure cost during the test period.

Single endpoint with model switching: A single serving service loads both models and selects the appropriate one based on the variant assignment.

  • Advantages: Single infrastructure, simpler operations.
  • Disadvantages: Less isolation. A bug in one model can affect the service hosting both.

For most agency deployments, separate endpoints are worth the extra cost. The isolation makes debugging easier and rollbacks instantaneous.

Metric Collection

For every prediction, log:

  • Variant assignment: Which model made this prediction?
  • Prediction input: The features used (or a hash for privacy)
  • Prediction output: The model's raw score and the decision
  • Timestamp: When the prediction was made
  • Outcome (when available): Did the predicted event occur? Was the recommendation clicked? Was the transaction fraudulent?

Store this data in a format optimized for analytical queries โ€” a data warehouse (BigQuery, Snowflake, Redshift) or an analytics database (ClickHouse).

Statistical Analysis

Choose your analysis method before starting the test:

Frequentist hypothesis testing (t-test, chi-squared test): The traditional approach. Calculate whether the observed difference between variants is statistically significant at a predetermined confidence level (typically 95%).

  • Fixed sample size: Calculate the required sample size before starting, based on the minimum detectable effect and the baseline metric variance
  • Run the test until you reach the required sample size โ€” no peeking at intermediate results and stopping early
  • Simple to implement and explain to stakeholders

Bayesian analysis: Calculate the probability that one variant is better than the other. No fixed sample size โ€” you can check results at any time and make decisions based on the current probability.

  • More intuitive for stakeholders: "There is a 94% probability that Model B is better" is easier to understand than "p = 0.03"
  • More flexible: you can stop the test when you have enough confidence, whether that takes one week or four
  • Requires choosing prior distributions, which adds a modeling decision

Sequential testing: A middle ground โ€” use statistical methods designed for continuous monitoring. Alpha-spending functions or always-valid p-values allow you to check results frequently without inflating false positive rates.

For agency work, Bayesian analysis is usually the best choice. It is more flexible, more intuitive, and produces outputs that non-technical stakeholders can understand.

Designing the A/B Test

Choosing Primary and Guardrail Metrics

Primary metric: The business metric you are trying to improve. Revenue per user, conversion rate, fraud detection rate, recommendation click-through rate. One primary metric per test โ€” do not try to optimize for everything simultaneously.

Guardrail metrics: Metrics that must not degrade, even if the primary metric improves. These protect against hidden negative effects.

Examples:

  • If the primary metric is conversion rate, guardrails might include average order value (making sure you are not just converting low-value orders), customer complaint rate, and return rate
  • If the primary metric is fraud detection rate, guardrails might include false positive rate, legitimate transaction block rate, and customer friction score
  • If the primary metric is recommendation click-through rate, guardrails might include conversion rate (clicked but did not buy suggests lower-quality recommendations), revenue per session, and category diversity

Determining Sample Size and Duration

Minimum test duration: 2 weeks. This captures day-of-week effects and provides enough data for stable estimates.

Sample size depends on:

  • The baseline rate of the primary metric
  • The minimum detectable effect (the smallest improvement worth detecting)
  • The desired statistical power (typically 80%)
  • The number of variants (more variants require more total traffic)

Rule of thumb: For a 5% relative improvement in a metric with 2% baseline rate, you need approximately 50,000 observations per variant. For a 5% relative improvement in a metric with 20% baseline rate, you need approximately 5,000 observations per variant.

Ramp-up schedule:

  • Day 1: 5% of traffic to the new model (safety check)
  • Day 3: If no issues, increase to 10%
  • Day 7: If no issues, increase to 25%
  • Day 14: If no issues and results are positive, increase to 50%
  • Day 21: If results confirm, increase to 100%

This gradual ramp limits the blast radius if the new model has an unexpected issue.

Test Pre-Registration

Before starting the test, document:

  • What you are testing and why
  • The primary metric and guardrail metrics
  • The minimum detectable effect
  • The planned sample size and duration
  • The decision criteria (what result leads to deployment vs. rollback)
  • The ramp-up schedule

This prevents post-hoc rationalization โ€” the temptation to cherry-pick metrics that look good after seeing the results.

Common A/B Testing Mistakes in ML

Mistake 1: Peeking and stopping early. Checking results daily and stopping the test as soon as significance is reached inflates false positive rates dramatically. Use sequential testing methods if you need to monitor continuously, or commit to the pre-planned duration.

Mistake 2: Testing too many changes simultaneously. If the new model has a different architecture, different features, and different post-processing, you cannot determine which change caused the observed effect. Ideally, change one thing at a time.

Mistake 3: Ignoring novelty effects. Users may react differently to a new experience simply because it is new, not because it is better. This is especially common with recommendation systems โ€” users explore novel recommendations initially, inflating click-through rates. Run tests long enough for the novelty to wear off.

Mistake 4: Not accounting for network effects. In some applications, treating one user differently affects other users. In a marketplace, showing different prices to different users affects supply and demand dynamics for everyone. Standard A/B testing assumes independence between units, which does not hold in these cases.

Mistake 5: Only testing in ideal conditions. Run your A/B test through the full range of conditions โ€” weekdays and weekends, peak and off-peak, normal and promotional periods. A model that wins on Tuesday might lose on Black Friday.

Mistake 6: Not testing the rollback. Verify that you can roll back to the old model instantly if problems are detected. Test the rollback mechanism before you need it.

Automating A/B Testing for Agencies

If you deliver ML models regularly, build a reusable A/B testing framework:

Components:

  • Traffic splitting service with configurable percentages per experiment
  • Metric collection pipeline that links predictions to outcomes
  • Statistical analysis module that computes significance and win probabilities
  • Dashboard showing real-time experiment results with guardrail monitoring
  • Automated alerts when guardrails are breached
  • Automated ramp-up logic that increases traffic percentage based on predefined criteria
  • Rollback automation that reverts to the control model if guardrails are violated

This framework, once built, reduces the A/B testing overhead for each new model deployment from weeks to hours. The first deployment takes effort to set up. Every subsequent deployment follows the same pattern with minimal configuration.

Pricing A/B Testing as Part of Model Delivery

Include A/B testing in every model deployment project. It is not an optional add-on โ€” it is a professional delivery requirement.

Budget allocation:

  • A/B testing infrastructure setup: $8,000 - $15,000 (one-time, amortized across projects)
  • Per-model test design, execution, and analysis: $5,000 - $10,000
  • Monitoring during the test period: included in the operations retainer

Frame it to clients: "Our deployment process includes a controlled rollout with real-time monitoring. This ensures that every model update improves your business metrics before it goes to 100% of your users. No surprises, no rollbacks, no lost revenue."

Your Next Step

For your next model deployment, set up a simple A/B test. Route 10% of traffic to the new model and 90% to the existing model. Track the primary business metric for both groups for two weeks. Compare the results. Even if you do not have a formal statistical framework yet, this basic comparison will reveal whether the new model actually improves business outcomes โ€” not just offline metrics. Build from there. Once you see the value of A/B testing firsthand, you will never ship a model without it again.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification