A Registry Is Infrastructure; Plays Make It Behave Under Pressure

A registry is infrastructure. A playbook is what turns that infrastructure into reliable behavior under pressure. Most teams have some version control and no playbook, which means every deploy, every incident, and every audit request becomes an improvised decision made by whoever is around. This article is the operating playbook: the specific plays, what triggers each one, who owns it, and the order the steps run in. It assumes you have the basic mechanics from A Framework for Ai Model Version Control and want to operationalize them.

Read each play as a named, repeatable routine. The value is not the cleverness of any single play; it is that nobody has to think from scratch when the moment arrives.

Play 1: Register a New Version

Trigger: any change to a behavior-affecting component — new fine-tune, prompt edit, config change, or reindex.

Owner: the engineer making the change.

Sequence:

Compute the artifact identity (hash, fine-tune ID, or composite ID for system changes).
Run the standard eval suite against the candidate.
Register the version with mandatory metadata: owner, parent version, change reason, eval score.
Store the artifact or reference under its immutable ID.

This play should be automated or enforced at the deploy gate. If a human can skip it, the practice rots. The whole playbook depends on this one running every time.

Play 2: Promote to Production

Trigger: a registered version passes eval thresholds and is selected to go live.

Owner: the model owner, with a second reviewer for high-stakes models.

Sequence:

Confirm the candidate's eval score meets or beats the current production version.
For higher-stakes systems, canary the version to a small traffic slice and compare live metrics against the incumbent.
Repoint the production tag to the new version.
Record the promotion event with timestamp and approver.

The deliberate design choice here is that promotion is a tag move, never a redeploy of an artifact by hand. That single indirection is what makes Play 3 fast.

Play 3: Emergency Rollback

Trigger: a production regression detected via monitoring, eval drift, or incident report.

Owner: whoever is on call — which means everyone must know this play.

Sequence:

Identify the last known-good version from the registry (the one with a passing eval that ran cleanly in production).
Repoint the production tag to that version.
Confirm behavior reverts via the eval suite or live metrics.
Open an incident record linking the bad version, the symptom, and the recovery time.

The metric that matters: time-to-rollback. If it exceeds a few minutes, the indirection from Play 2 is broken — fix it before the next incident. Rehearse this play on a schedule, not just during real incidents. The The Hidden Risks of Ai Model Version Control (and How to Manage Them) piece explains why an unrehearsed rollback is a liability.

Play 4: Run an Experiment

Trigger: a hypothesis worth testing — a new fine-tune direction, a prompt variant, a different model.

Owner: the experimenting engineer.

Sequence:

Branch from a registered parent version, recording the lineage.
Register the experimental version with an experiment tag and a short retention window.
Evaluate against the same suite as production for a fair comparison.
If it wins, promote via Play 2; if it loses, let retention garbage-collect it while keeping the metadata.

This play keeps experimentation safe and traceable. Because rollback is trivial, experiments can be bold. The lineage record means a winning result is always explainable.

Play 5: Respond to an Audit Request

Trigger: an internal or external request to prove which version made a specific decision.

Owner: the model owner or compliance contact.

Sequence:

Look up the request in production logs to find the exact version ID that handled it.
Pull that version's full record: model, prompt, config, dataset lineage, eval score.
Provide the reproducible state and the approval trail.

This play only works if Play 2 logged request-to-version attribution. If production responses do not record the version that served them, this play fails — which is the single most common audit gap.

Play 6: Garbage Collection

Trigger: a scheduled retention cycle.

Owner: the version control owner.

Sequence:

Identify versions past their retention window that never served production.
Delete the heavy artifacts.
Retain all metadata, eval scores, and lineage permanently.

This play keeps storage cost and registry clutter bounded without destroying the audit trail. Decide the retention policy before you need this play, because retrofitting deletion onto a sprawling store is painful.

Sequencing the Plays Into a Rhythm

Individually these plays are simple. The discipline is in how they chain: Play 1 feeds Play 2, Play 2 enables fast Play 3, Play 4 routes winners back through Play 2, and Play 5 depends on attribution that Play 2 must capture. Run them as a system and the practice becomes boring in the best way — predictable under pressure. For the workflow that strings the day-to-day plays together, see Building a Repeatable Workflow for Ai Model Version Control.

Play 7: Investigate a Quality Regression

Trigger: monitoring or user reports suggest output quality has slipped, but the cause is unclear.

Owner: the model owner.

Sequence:

Pull the production version history and identify what changed and when relative to the symptom's onset.
Check the full composite version — not just weights, but prompt, config, and index — since the change is often in a component people forget to inspect.
Reproduce the suspect version against the eval suite to confirm the regression is real and not noise.
If confirmed, run Play 3 to roll back; if not, widen the investigation to data or upstream dependencies.

The reason this play exists separately from emergency rollback is that not every quality complaint is a clear-cut regression. This play is the diagnostic step that decides whether rollback is even the right move. It depends entirely on having a clean version history to correlate against — without it, you are guessing.

Adapting the Playbook to Your Stakes

These plays are written for a team running models in production with real consequences. Scale them to your situation rather than adopting all seven wholesale. A low-stakes internal tool might run only Plays 1, 2, and 3 — register, promote, roll back — and skip the audit and investigation plays until they are needed. A regulated, high-traffic system needs all seven plus stricter ownership and review. The discipline is not running every play; it is having decided, in advance, which plays you run and who owns them, so nobody improvises a critical decision at the worst moment. Match the playbook's weight to the cost of getting it wrong.

Frequently Asked Questions

What is the difference between a playbook and a workflow here?

A workflow is the routine day-to-day sequence of versioning and deploying. A playbook is the broader set of named plays including the exceptional moments — emergency rollback, audit response, garbage collection — each with its own trigger and owner. The playbook covers the situations the happy-path workflow does not.

Who should own the emergency rollback play?

Whoever is on call, which means the entire team must know it cold. Emergency rollback is the one play you cannot afford to improvise, so it should be rehearsed on a schedule and documented to a single page.

How do I make sure the audit play actually works?

Ensure the promotion play logs the exact version ID on every production response. The audit play depends entirely on request-to-version attribution; without it you can describe versions that existed but not prove which one handled a specific case.

Should every play be automated?

Registration and garbage collection benefit most from automation; promotion and rollback need human judgment but should be mechanically simple. The goal is that judgment-free steps are automated and judgment steps are fast and well-defined, not that everything runs untouched.

How often should we rehearse the rollback play?

On a regular cadence — at minimum quarterly, and after any change to the deployment path. An unrehearsed rollback decays into a hope, and the cost of discovering it is broken arrives at the worst possible moment.

Key Takeaways

Treat version control as a set of named plays, each with a trigger, an owner, and a fixed sequence.
Registration must be automated or enforced at the deploy gate, because every other play depends on it.
Promotion is a tag move, which is what makes emergency rollback fast — and rollback must be rehearsed.
The audit play depends on logging request-to-version attribution during promotion.
Garbage collection bounds cost by dropping heavy artifacts while retaining all metadata and lineage forever.

Read each play as a named, repeatable routine. The value is not the cleverness of any single play; it is that nobody has to think from scratch when the moment arrives.

Play 1: Register a New Version

Trigger: any change to a behavior-affecting component — new fine-tune, prompt edit, config change, or reindex.

Owner: the engineer making the change.

Sequence:

Compute the artifact identity (hash, fine-tune ID, or composite ID for system changes).
Run the standard eval suite against the candidate.
Register the version with mandatory metadata: owner, parent version, change reason, eval score.
Store the artifact or reference under its immutable ID.

This play should be automated or enforced at the deploy gate. If a human can skip it, the practice rots. The whole playbook depends on this one running every time.

Play 2: Promote to Production

Trigger: a registered version passes eval thresholds and is selected to go live.

Owner: the model owner, with a second reviewer for high-stakes models.

Sequence:

Confirm the candidate's eval score meets or beats the current production version.
For higher-stakes systems, canary the version to a small traffic slice and compare live metrics against the incumbent.
Repoint the production tag to the new version.
Record the promotion event with timestamp and approver.

The deliberate design choice here is that promotion is a tag move, never a redeploy of an artifact by hand. That single indirection is what makes Play 3 fast.

Play 3: Emergency Rollback

Trigger: a production regression detected via monitoring, eval drift, or incident report.

Owner: whoever is on call — which means everyone must know this play.

Sequence:

Identify the last known-good version from the registry (the one with a passing eval that ran cleanly in production).
Repoint the production tag to that version.
Confirm behavior reverts via the eval suite or live metrics.
Open an incident record linking the bad version, the symptom, and the recovery time.

Play 4: Run an Experiment

Trigger: a hypothesis worth testing — a new fine-tune direction, a prompt variant, a different model.

Owner: the experimenting engineer.

Sequence:

Branch from a registered parent version, recording the lineage.
Register the experimental version with an experiment tag and a short retention window.
Evaluate against the same suite as production for a fair comparison.
If it wins, promote via Play 2; if it loses, let retention garbage-collect it while keeping the metadata.

This play keeps experimentation safe and traceable. Because rollback is trivial, experiments can be bold. The lineage record means a winning result is always explainable.

Play 5: Respond to an Audit Request

Trigger: an internal or external request to prove which version made a specific decision.

Owner: the model owner or compliance contact.

Sequence:

Look up the request in production logs to find the exact version ID that handled it.
Pull that version's full record: model, prompt, config, dataset lineage, eval score.
Provide the reproducible state and the approval trail.

Play 6: Garbage Collection

Trigger: a scheduled retention cycle.

Owner: the version control owner.

Sequence:

Identify versions past their retention window that never served production.
Delete the heavy artifacts.
Retain all metadata, eval scores, and lineage permanently.

Sequencing the Plays Into a Rhythm

Play 7: Investigate a Quality Regression

Trigger: monitoring or user reports suggest output quality has slipped, but the cause is unclear.

Owner: the model owner.

Sequence:

Pull the production version history and identify what changed and when relative to the symptom's onset.
Check the full composite version — not just weights, but prompt, config, and index — since the change is often in a component people forget to inspect.
Reproduce the suspect version against the eval suite to confirm the regression is real and not noise.
If confirmed, run Play 3 to roll back; if not, widen the investigation to data or upstream dependencies.

Adapting the Playbook to Your Stakes

Frequently Asked Questions

What is the difference between a playbook and a workflow here?

Who should own the emergency rollback play?

How do I make sure the audit play actually works?

Should every play be automated?

How often should we rehearse the rollback play?

Key Takeaways

Treat version control as a set of named plays, each with a trigger, an owner, and a fixed sequence.
Registration must be automated or enforced at the deploy gate, because every other play depends on it.
Promotion is a tag move, which is what makes emergency rollback fast — and rollback must be rehearsed.
The audit play depends on logging request-to-version attribution during promotion.
Garbage collection bounds cost by dropping heavy artifacts while retaining all metadata and lineage forever.

A Registry Is Infrastructure; Plays Make It Behave Under Pressure

Play 1: Register a New Version

Play 2: Promote to Production

Play 3: Emergency Rollback

Play 4: Run an Experiment

Play 5: Respond to an Audit Request

Play 6: Garbage Collection

Sequencing the Plays Into a Rhythm

Play 7: Investigate a Quality Regression

Adapting the Playbook to Your Stakes

Frequently Asked Questions

What is the difference between a playbook and a workflow here?

Who should own the emergency rollback play?

How do I make sure the audit play actually works?

Should every play be automated?

How often should we rehearse the rollback play?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

A Registry Is Infrastructure; Plays Make It Behave Under Pressure

Play 1: Register a New Version

Play 2: Promote to Production

Play 3: Emergency Rollback

Play 4: Run an Experiment

Play 5: Respond to an Audit Request

Play 6: Garbage Collection

Sequencing the Plays Into a Rhythm

Play 7: Investigate a Quality Regression

Adapting the Playbook to Your Stakes

Frequently Asked Questions

What is the difference between a playbook and a workflow here?

Who should own the emergency rollback play?

How do I make sure the audit play actually works?

Should every play be automated?

How often should we rehearse the rollback play?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?