AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationThe Manual ProcessThe First DesignThe DecisionDiagnosing the FailureChoosing to RebuildThe ExecutionThe New Link StructureWhat Changed Beyond the CountThe OutcomeMeasurable ImprovementsThe Broader LessonWhat the Team Would Do DifferentlyStart With the Outcome, Not the ProcessAdd Observability on Day OneValidate the Foundation Link HardestTreat the Rebuild as the Real DesignHow to Apply This to Your Own WorkFrequently Asked QuestionsWhy did the eleven-link chain fail when each link tested fine in isolation?Why did merging links improve quote accuracy?How did logging change the team's ability to fix problems?Is fewer links always better?What was the single most important change in the rebuild?Key Takeaways
Home/Blog/From 11 Brittle Links Down to 4 Reliable Ones
General

From 11 Brittle Links Down to 4 Reliable Ones

A

Agency Script Editorial

Editorial Team

·March 9, 2024·6 min read
prompt chainingprompt chaining case studyprompt chaining guideprompt engineering

A product team set out to automate one of their most tedious workflows: turning recorded customer interviews into structured, actionable insight records. The transcripts were long, the formats varied, and a human analyst was spending hours per interview pulling out themes, quotes, and recommended actions. Automating it with a prompt chain seemed obvious. Their first attempt nearly convinced them chaining did not work.

This is the story of that first attempt, why it failed, the decision to rebuild it differently, and what the rebuilt chain delivered. The numbers in this account are illustrative of the shape of the outcome rather than measurements from a specific deployment, but the arc, over-decomposition followed by disciplined redesign, is one almost every team repeats.

The lesson is not that prompt chaining is hard. It is that the instinct to split a task into as many pieces as possible is exactly backward, and that the fix is usually fewer, better-defined links.

The Situation

The analyst workflow had a clear shape, which is why it looked so chainable.

The Manual Process

For each interview, the analyst would read the transcript, identify recurring themes, pull representative quotes, judge sentiment, map themes to the product roadmap, and write a short recommendation. Six distinct mental steps, each one a candidate for its own link.

The First Design

The team built a chain that mirrored the mental process exactly, and then some. Counting setup and formatting steps, it ran eleven links deep. Each link was reasonable on its own. In isolated testing, every link passed.

The Decision

When the chain went live, the insight records were unreliable. Themes were sometimes missing, quotes occasionally did not match the themes, and the recommendations read as generic.

Diagnosing the Failure

The team did the math that the Prompt Chaining: Best Practices That Actually Work guide recommends. Eleven links, each around 92 percent reliable, multiply to roughly 40 percent end-to-end reliability. The chain was not broken at any one point; it was broken everywhere a little, and the errors compounded. Worse, with no per-link logging, they could not see which link started the cascade.

Choosing to Rebuild

Rather than patch individual links, they decided to redesign from the task down. The guiding question changed from "what are all the steps?" to "what is the fewest number of links where each is independently reliable?" This reframing came straight from A Framework for Prompt Chaining.

The Execution

The rebuild collapsed eleven links into four, each with a strict contract and validation.

The New Link Structure

  • Link one extracted themes and their supporting quotes together, returning a JSON array with each theme tied to its quotes.
  • Link two validated and scored each theme's relevance to the roadmap.
  • Link three wrote one recommendation per high-relevance theme.
  • Link four assembled the final structured record.

What Changed Beyond the Count

Merging theme and quote extraction into one link solved the mismatch problem, because the quotes were now produced alongside the themes they supported rather than in a separate step that had to re-find them. The team added logging to every link and a validation check after the extraction step, so a malformed result stopped the chain instead of poisoning it. This validation discipline is covered in A Step-by-Step Approach to Prompt Chaining.

The Outcome

The four-link chain performed dramatically better than its eleven-link predecessor.

Measurable Improvements

  • End-to-end reliability rose from roughly 40 percent to the high 80s, consistent with four links each above 95 percent.
  • Quote-to-theme mismatches, the most visible defect, effectively disappeared because the two were extracted together.
  • When failures did occur, per-link logging let the team identify the responsible link in minutes instead of hours.

The Broader Lesson

The team's takeaway was counterintuitive. The fix for an unreliable chain was not more structure but less, paired with stricter contracts and real observability. The full set of patterns they ended up following is captured in the Prompt Chaining Checklist for 2026.

What the Team Would Do Differently

Asked what they would change if they started over, the team named three things, each one a lesson worth borrowing before you build your own chain.

Start With the Outcome, Not the Process

Their first design copied the analyst's mental steps one for one. In hindsight, the right starting point was the structured record they wanted at the end, working backward to the fewest links that could produce it. Mirroring a human process produces too many links because humans break work into small mental moves that a model can often handle in one pass. The lesson maps directly to the design discipline in A Framework for Prompt Chaining.

Add Observability on Day One

The most painful part of the first attempt was not that it failed but that they could not see why. Logging every link from the start would have turned a multi-day investigation into a quick diagnosis. They now treat per-link logging as the first thing they build, not the last.

Validate the Foundation Link Hardest

Because the extraction link fed everything downstream, its errors did the most damage. In the rebuild, that link got the strictest contract and the most thorough validation. The team learned to spend their validation budget where the blast radius is largest, on the early, foundational links. A small investment in the first link's reliability paid off many times over because every downstream link inherited its quality.

Treat the Rebuild as the Real Design

In hindsight, the team came to see the eleven-link version not as a failure but as an expensive way to learn the requirements. Had they prototyped on a handful of inputs and measured before scaling, they would have reached the four-link design far sooner. The lesson they pass along is to measure early and often, treating the first build as a draft rather than a commitment.

How to Apply This to Your Own Work

The arc here, over-decompose, measure, rebuild shorter, is so common it is worth front-running. Before you build, ask whether your link plan is mirroring a human process. If it is, push for the fewest links that produce the outcome. Add logging before you need it, and put your strictest validation on the earliest links. Doing this from the start lets you skip the painful first version entirely. The concrete procedure for building the rebuilt kind of chain is laid out in A Step-by-Step Approach to Prompt Chaining.

Frequently Asked Questions

Why did the eleven-link chain fail when each link tested fine in isolation?

Because reliability multiplies across links. Eleven links at around 92 percent each compound to roughly 40 percent end to end. The chain failed not at one point but a little everywhere, and the errors accumulated.

Why did merging links improve quote accuracy?

Extracting themes and their quotes together meant each quote was produced alongside the theme it supported. The original design extracted them in separate links, forcing a later step to re-match quotes to themes, which it sometimes got wrong.

How did logging change the team's ability to fix problems?

Per-link logging let them see each intermediate output, so when a failure occurred they could identify the responsible link in minutes. Without it, a wrong final result gave them nowhere to look.

Is fewer links always better?

Fewer is better only up to the point where each link stays independently reliable. The goal is the smallest number of links that reliably do the job, not the smallest number possible. Merging links that always succeed together is the right move; merging links that then become unreliable is not.

What was the single most important change in the rebuild?

Reframing the design question from listing every step to finding the fewest reliable links. That shift drove the collapse from eleven links to four and, combined with contracts and logging, produced the reliability gain.

Key Takeaways

  • Mirroring a manual process step for step often produces too many links and poor reliability.
  • Reliability multiplies across links, so many decent links can compound into a poor end-to-end result.
  • Merging steps that belong together, like themes and their quotes, removes whole classes of mismatch errors.
  • Strict contracts plus validation after key links stop malformed output from propagating.
  • Per-link logging turns hours of debugging into minutes by exposing each intermediate result.
  • The fix for an unreliable chain is usually fewer, better-defined links rather than more structure.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification