This is the story of one prompt and the evaluation that saved it from a quiet failure in production. The numbers are illustrative, chosen to show the shape of the work rather than to report a specific company's results. What is real is the arc: a prompt that looked finished, an evaluation that proved it was not, and the methodical path to a version the team could defend.
The setup is a common one. A small product team needed to generate marketing descriptions for thousands of catalog items. The prompt they had written produced lovely copy for the handful of products they tried, and the instinct was to ship it and move on. One skeptical engineer asked for an evaluation first. That request is where the story begins.
The Situation: A Prompt That Demoed Well
The prompt took a product's structured attributes — name, category, key features, price tier — and produced a two-sentence description. In the demo it was genuinely good: fluent, on-brand, and accurate for the three products tested.
The temptation to declare victory was strong. But the catalog held thousands of products spanning dozens of categories, many with sparse or messy attributes. Three good outputs said nothing about how the prompt would handle that variety, and a bad description on a live product page is a small but real reputational cost multiplied across thousands of pages.
The Decision: Build a Real Test Set
Instead of shipping, the team assembled a test set of 50 products sampled to mirror the catalog: popular categories, obscure ones, products with rich attributes, and products with almost none. They wrote success criteria first — two sentences, mention the category, name at least one real feature, no invented features, no superlatives the data did not support.
Writing the no-invented-features rule turned out to be the most consequential decision. It named the failure mode that mattered most: a description that sounded great but claimed something untrue.
For why criteria-first discipline matters, see Evaluating Prompt Quality: Best Practices That Actually Work.
The Execution: Score, Diagnose, Iterate
The team ran the prompt against all 50 products, scoring each output pass or fail against the criteria, and ran the trickier products three times to check consistency.
The First Result
The pass rate came in at 71 percent. The failures clustered cleanly:
- Products with sparse attributes triggered invented features — the prompt filled gaps with plausible fiction.
- A handful of outputs ran to three or four sentences, breaking the format rule.
- Several leaned on unsupported superlatives like best-in-class.
A 71 percent pass rate on a customer-facing task was nowhere near shippable, and the demo had hidden every one of these problems.
The Iterations
The team made targeted changes one at a time. First, an explicit instruction to describe only the attributes provided and to write a shorter description when data was sparse. Re-running the full set lifted the pass rate and eliminated most invented features. Second, a hard two-sentence constraint fixed the length failures. Third, an instruction banning unsupported superlatives cleared the remaining cases.
Crucially, they re-ran all 50 products after every change to catch regressions, and one early fix did briefly break two previously passing cases, which they caught and corrected.
The Outcome and Lessons
After three iterations the pass rate reached a level the team judged acceptable for the task, with the dangerous invented-feature failures driven to near zero. They documented the test set, the final pass rate, and the decision, then shipped with a plan to fold real production failures back into the test set over time.
The lessons generalize well beyond product descriptions. A demo measures charm, not reliability. The failure that matters most — here, fabrication — is the one worth naming in your criteria up front. And re-running the whole set after each change is the only way to know you are improving rather than trading one failure for another.
There is a subtler lesson too, about the value of the skeptical engineer who asked for an evaluation in the first place. Without that single request, the prompt would have shipped on the strength of three good demos and the fabrication problem would have surfaced later as customer complaints about products that did not match their descriptions. The cost of catching it before launch was a few hours of one person's time. The cost of catching it after launch would have been support tickets, eroded trust, and a frantic rollback. Evaluation is cheapest exactly when it feels least necessary, which is precisely when the prompt looks finished.
That asymmetry is worth internalizing. The pressure to skip evaluation is always highest when a prompt demos well, because the work feels redundant. But a demo that looks finished is the most dangerous kind, since it invites confidence the evidence has not earned. The teams that avoid expensive production failures are the ones that treat a polished demo as a reason to evaluate, not a reason to ship.
To run this same arc yourself, follow A Step-by-Step Approach to Evaluating Prompt Quality, and to keep each pass honest, use The Evaluating Prompt Quality Checklist for 2026.
What Happened After Launch
The story did not end at deployment, and the most instructive part came later. The team had committed to folding production failures back into their test set, and within the first few weeks live traffic surfaced a category the original sample had underweighted: bundled products that combined items from two categories. The prompt, tuned on single-category items, produced awkward descriptions for these.
Because the team had a test set and a process rather than a one-time evaluation, handling this was routine rather than a fire drill. They added a batch of bundled products to the set, reproduced the failure offline, made a targeted change, and confirmed the fix without touching the cases that already worked. The evaluation had become a living asset that absorbed new failure modes as the world revealed them.
This is the quiet payoff of treating evaluation as an ongoing practice. The first evaluation prevented an embarrassing launch. The standing process turned every subsequent surprise into a manageable, repeatable correction instead of a scramble. For the distinction between offline and production evaluation that makes this loop work, see What Separates a Reliable Prompt From a Lucky One.
Frequently Asked Questions
Why was the demo so misleading in this case?
The demo used three products with rich, clean attributes, which is the easy case. The catalog's real difficulty lived in sparse and unusual products, none of which the demo touched. A demo naturally showcases favorable inputs, so it measures the prompt's best behavior rather than its reliability across the full range it will actually face.
What made the invented-features failure the priority?
Because it was both common on sparse inputs and genuinely harmful — a description claiming a feature the product lacks is a false statement on a live page. The team named it explicitly in their success criteria, which meant every evaluation tracked it directly and the iterations could target it rather than chasing cosmetic issues.
Why re-run all 50 products after every change instead of just the failures?
Because fixes cause regressions. In this case one early change broke two previously passing products, which only surfaced because the team re-ran everything. Comparing the full pass rate before and after each change is the sole reliable way to confirm net improvement rather than a sideways trade.
Was a perfect pass rate the goal?
No. The team set an acceptable floor for a customer-facing task and aimed to drive the dangerous failures to near zero, rather than chasing 100 percent on every cosmetic criterion. Defining that floor before evaluating kept the decision grounded in the cost of mistakes instead of in a desire for a perfect number.
Key Takeaways
- A demo showcases favorable inputs and measures charm, not reliability across the real distribution.
- A representative test set sampled to mirror production exposes the failures that matter.
- Name the most dangerous failure mode in your success criteria so every evaluation tracks it.
- Make targeted changes one at a time and re-run the whole set to catch regressions.
- Set an acceptable quality floor before evaluating, then document the test set, pass rate, and decision.