From 11 Brittle Links Down to 4 Reliable Ones

A product team set out to automate one of their most tedious workflows: turning recorded customer interviews into structured, actionable insight records. The transcripts were long, the formats varied, and a human analyst was spending hours per interview pulling out themes, quotes, and recommended actions. Automating it with a prompt chain seemed obvious. Their first attempt nearly convinced them chaining did not work.

This is the story of that first attempt, why it failed, the decision to rebuild it differently, and what the rebuilt chain delivered. The numbers in this account are illustrative of the shape of the outcome rather than measurements from a specific deployment, but the arc, over-decomposition followed by disciplined redesign, is one almost every team repeats.

The lesson is not that prompt chaining is hard. It is that the instinct to split a task into as many pieces as possible is exactly backward, and that the fix is usually fewer, better-defined links.

The Situation

The analyst workflow had a clear shape, which is why it looked so chainable.

The Manual Process

For each interview, the analyst would read the transcript, identify recurring themes, pull representative quotes, judge sentiment, map themes to the product roadmap, and write a short recommendation. Six distinct mental steps, each one a candidate for its own link.

The First Design

The team built a chain that mirrored the mental process exactly, and then some. Counting setup and formatting steps, it ran eleven links deep. Each link was reasonable on its own. In isolated testing, every link passed.

The Decision

When the chain went live, the insight records were unreliable. Themes were sometimes missing, quotes occasionally did not match the themes, and the recommendations read as generic.

Diagnosing the Failure

The team did the math that the Prompt Chaining: Best Practices That Actually Work guide recommends. Eleven links, each around 92 percent reliable, multiply to roughly 40 percent end-to-end reliability. The chain was not broken at any one point; it was broken everywhere a little, and the errors compounded. Worse, with no per-link logging, they could not see which link started the cascade.

Choosing to Rebuild

Rather than patch individual links, they decided to redesign from the task down. The guiding question changed from "what are all the steps?" to "what is the fewest number of links where each is independently reliable?" This reframing came straight from A Framework for Prompt Chaining.

The Execution

The rebuild collapsed eleven links into four, each with a strict contract and validation.

The New Link Structure

Link one extracted themes and their supporting quotes together, returning a JSON array with each theme tied to its quotes.
Link two validated and scored each theme's relevance to the roadmap.
Link three wrote one recommendation per high-relevance theme.
Link four assembled the final structured record.

What Changed Beyond the Count

Merging theme and quote extraction into one link solved the mismatch problem, because the quotes were now produced alongside the themes they supported rather than in a separate step that had to re-find them. The team added logging to every link and a validation check after the extraction step, so a malformed result stopped the chain instead of poisoning it. This validation discipline is covered in A Step-by-Step Approach to Prompt Chaining.

The Outcome

The four-link chain performed dramatically better than its eleven-link predecessor.

Measurable Improvements

End-to-end reliability rose from roughly 40 percent to the high 80s, consistent with four links each above 95 percent.
Quote-to-theme mismatches, the most visible defect, effectively disappeared because the two were extracted together.
When failures did occur, per-link logging let the team identify the responsible link in minutes instead of hours.

The Broader Lesson

The team's takeaway was counterintuitive. The fix for an unreliable chain was not more structure but less, paired with stricter contracts and real observability. The full set of patterns they ended up following is captured in the Prompt Chaining Checklist for 2026.

What the Team Would Do Differently

Asked what they would change if they started over, the team named three things, each one a lesson worth borrowing before you build your own chain.

Start With the Outcome, Not the Process

Their first design copied the analyst's mental steps one for one. In hindsight, the right starting point was the structured record they wanted at the end, working backward to the fewest links that could produce it. Mirroring a human process produces too many links because humans break work into small mental moves that a model can often handle in one pass. The lesson maps directly to the design discipline in A Framework for Prompt Chaining.

Add Observability on Day One

The most painful part of the first attempt was not that it failed but that they could not see why. Logging every link from the start would have turned a multi-day investigation into a quick diagnosis. They now treat per-link logging as the first thing they build, not the last.

Validate the Foundation Link Hardest

Because the extraction link fed everything downstream, its errors did the most damage. In the rebuild, that link got the strictest contract and the most thorough validation. The team learned to spend their validation budget where the blast radius is largest, on the early, foundational links. A small investment in the first link's reliability paid off many times over because every downstream link inherited its quality.

Treat the Rebuild as the Real Design

In hindsight, the team came to see the eleven-link version not as a failure but as an expensive way to learn the requirements. Had they prototyped on a handful of inputs and measured before scaling, they would have reached the four-link design far sooner. The lesson they pass along is to measure early and often, treating the first build as a draft rather than a commitment.

How to Apply This to Your Own Work

The arc here, over-decompose, measure, rebuild shorter, is so common it is worth front-running. Before you build, ask whether your link plan is mirroring a human process. If it is, push for the fewest links that produce the outcome. Add logging before you need it, and put your strictest validation on the earliest links. Doing this from the start lets you skip the painful first version entirely. The concrete procedure for building the rebuilt kind of chain is laid out in A Step-by-Step Approach to Prompt Chaining.

Frequently Asked Questions

Why did the eleven-link chain fail when each link tested fine in isolation?

Because reliability multiplies across links. Eleven links at around 92 percent each compound to roughly 40 percent end to end. The chain failed not at one point but a little everywhere, and the errors accumulated.

Why did merging links improve quote accuracy?

Extracting themes and their quotes together meant each quote was produced alongside the theme it supported. The original design extracted them in separate links, forcing a later step to re-match quotes to themes, which it sometimes got wrong.

How did logging change the team's ability to fix problems?

Per-link logging let them see each intermediate output, so when a failure occurred they could identify the responsible link in minutes. Without it, a wrong final result gave them nowhere to look.

Is fewer links always better?

Fewer is better only up to the point where each link stays independently reliable. The goal is the smallest number of links that reliably do the job, not the smallest number possible. Merging links that always succeed together is the right move; merging links that then become unreliable is not.

What was the single most important change in the rebuild?

Reframing the design question from listing every step to finding the fewest reliable links. That shift drove the collapse from eleven links to four and, combined with contracts and logging, produced the reliability gain.

Key Takeaways

Mirroring a manual process step for step often produces too many links and poor reliability.
Reliability multiplies across links, so many decent links can compound into a poor end-to-end result.
Merging steps that belong together, like themes and their quotes, removes whole classes of mismatch errors.
Strict contracts plus validation after key links stop malformed output from propagating.
Per-link logging turns hours of debugging into minutes by exposing each intermediate result.
The fix for an unreliable chain is usually fewer, better-defined links rather than more structure.

The lesson is not that prompt chaining is hard. It is that the instinct to split a task into as many pieces as possible is exactly backward, and that the fix is usually fewer, better-defined links.

The Situation

The analyst workflow had a clear shape, which is why it looked so chainable.

The Manual Process

The First Design

The Decision

When the chain went live, the insight records were unreliable. Themes were sometimes missing, quotes occasionally did not match the themes, and the recommendations read as generic.

Diagnosing the Failure

Choosing to Rebuild

The Execution

The rebuild collapsed eleven links into four, each with a strict contract and validation.

The New Link Structure

Link one extracted themes and their supporting quotes together, returning a JSON array with each theme tied to its quotes.
Link two validated and scored each theme's relevance to the roadmap.
Link three wrote one recommendation per high-relevance theme.
Link four assembled the final structured record.

What Changed Beyond the Count

The Outcome

The four-link chain performed dramatically better than its eleven-link predecessor.

Measurable Improvements

End-to-end reliability rose from roughly 40 percent to the high 80s, consistent with four links each above 95 percent.
Quote-to-theme mismatches, the most visible defect, effectively disappeared because the two were extracted together.
When failures did occur, per-link logging let the team identify the responsible link in minutes instead of hours.

The Broader Lesson

What the Team Would Do Differently

Asked what they would change if they started over, the team named three things, each one a lesson worth borrowing before you build your own chain.

Start With the Outcome, Not the Process

Add Observability on Day One

Validate the Foundation Link Hardest

Treat the Rebuild as the Real Design

How to Apply This to Your Own Work

Frequently Asked Questions

Why did the eleven-link chain fail when each link tested fine in isolation?

Why did merging links improve quote accuracy?

How did logging change the team's ability to fix problems?

Per-link logging let them see each intermediate output, so when a failure occurred they could identify the responsible link in minutes. Without it, a wrong final result gave them nowhere to look.

Is fewer links always better?

What was the single most important change in the rebuild?

Key Takeaways

Mirroring a manual process step for step often produces too many links and poor reliability.
Reliability multiplies across links, so many decent links can compound into a poor end-to-end result.
Merging steps that belong together, like themes and their quotes, removes whole classes of mismatch errors.
Strict contracts plus validation after key links stop malformed output from propagating.
Per-link logging turns hours of debugging into minutes by exposing each intermediate result.
The fix for an unreliable chain is usually fewer, better-defined links rather than more structure.

From 11 Brittle Links Down to 4 Reliable Ones

The Situation

The Manual Process

The First Design

The Decision

Diagnosing the Failure

Choosing to Rebuild

The Execution

The New Link Structure

What Changed Beyond the Count

The Outcome

Measurable Improvements

The Broader Lesson

What the Team Would Do Differently

Start With the Outcome, Not the Process

Add Observability on Day One

Validate the Foundation Link Hardest

Treat the Rebuild as the Real Design

How to Apply This to Your Own Work

Frequently Asked Questions

Why did the eleven-link chain fail when each link tested fine in isolation?

Why did merging links improve quote accuracy?

How did logging change the team's ability to fix problems?

Is fewer links always better?

What was the single most important change in the rebuild?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

From 11 Brittle Links Down to 4 Reliable Ones

The Situation

The Manual Process

The First Design

The Decision

Diagnosing the Failure

Choosing to Rebuild

The Execution

The New Link Structure

What Changed Beyond the Count

The Outcome

Measurable Improvements

The Broader Lesson

What the Team Would Do Differently

Start With the Outcome, Not the Process

Add Observability on Day One

Validate the Foundation Link Hardest

Treat the Rebuild as the Real Design

How to Apply This to Your Own Work

Frequently Asked Questions

Why did the eleven-link chain fail when each link tested fine in isolation?

Why did merging links improve quote accuracy?

How did logging change the team's ability to fix problems?

Is fewer links always better?

What was the single most important change in the rebuild?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?