A mid-sized software company had wired a language model into its support workflow to draft first-response replies for agents to review. The pilot looked great. The drafts were fluent, empathetic, and often nearly send-ready. Then the team tried to route the same drafts through automation, and the cracks appeared.
This is the story of how that team moved from free-form generation to constraint-based output prompting, the specific decisions they made along the way, and what the shift bought them. The point is not the exact numbers, which are particular to their setup, but the shape of the journey, because it repeats across teams that start with impressive demos and hit the same wall.
The wall, in short, is that fluent output and reliable output are different problems. A model can be excellent at the first and unreliable at the second, and a pilot that only measures readability will never reveal the gap. This team learned that the hard way, and the sequence of corrections they ran is the most useful part of the story, because each correction maps to a failure mode any team can hit.
The Situation
A demo that did not survive contact
The original prompt asked the model to "write a helpful, friendly first response." For human review that was fine. But the company wanted to auto-attach a category, a suggested priority, and a draft body so a downstream system could pre-sort the queue.
Where it broke
The model embedded the category inside the prose, phrased priority inconsistently ("high," "urgent," "P1"), and sometimes opened with a preamble the parser could not skip. Roughly one draft in six failed automated handling, which meant a human had to intervene exactly where automation was supposed to help.
The Decision
Reframing the output as a contract
The team stopped thinking of the output as a message and started thinking of it as a structured record with a message field inside it. That reframing is the heart of constraint-based prompting and connects directly to the model in A Decision System for Shaping Model Output.
Choosing what to lock and what to leave loose
They locked the envelope hard: fixed JSON keys, a closed set of priority values, an enumerated category list. They left the message body loose, because that is where the model's fluency was a genuine asset. This split mirrors the advice in Opinionated Rules for Shaping Reliable Model Output.
The Execution
Building the harness first
Before rewriting the prompt, they assembled a set of 200 real, messy tickets, including empty ones and angry ones, and defined pass criteria: valid JSON, priority from the allowed set, category from the list, and a non-empty body. Only then did they tune the prompt against that set.
Iterating on the constraints
The first constrained prompt passed about 88 percent. The failures clustered on long, multi-issue tickets where the format instruction at the top had lost weight. They restated the format rule immediately before the output marker and added an explicit "output only the object" exclusion. Pass rate climbed into the high nineties.
Guarding against over-constraint
When they briefly added a forced body length, quality dropped and agents complained the drafts felt clipped. They removed it. This is the over-constraint failure described in Seven Ways Output Constraints Quietly Break Your Prompts, caught early because the harness measured body usefulness, not just format validity.
The Outcome
What changed
Automated handling went from failing one draft in six to failing roughly one in forty. The queue pre-sorting that had been theoretical became real, and agents spent their first minutes on judgment rather than reformatting.
What the team learned
The fluent demo had hidden the real work. The value was not better prose; it was prose delivered inside a contract the rest of the system could trust. The metrics that made the improvement visible are the kind described in Reading the Signal: What to Track When Outputs Must Conform.
The Lessons That Transferred
A demo measures the wrong thing
The most expensive lesson was cheap to state: the pilot measured readability, but production needed parseability, and those are different bars. Any team running a constrained-output pilot should ask, before celebrating, whether they are measuring what the downstream system actually requires. The team now writes pass criteria from the consumer's perspective before building any prompt, a habit drawn from the best-practice rules.
The harness was the real product
In hindsight, the prompt was almost incidental. The durable asset the team built was the test harness of 200 messy tickets and its pass criteria. When the model provider shipped an update months later, the harness caught a small regression in minutes, and the team adjusted the prompt with confidence rather than guessing. Without the harness, that regression would have surfaced as a slow trickle of support escalations.
Constraints needed pruning, not just adding
The team's instinct under pressure was to add constraints. The forced-length episode taught them the opposite reflex: when quality drops, look for a constraint to remove, not one to add. They now audit the prompt quarterly, removing any rule that no longer prevents a measured failure, which keeps the prompt lean and the content strong.
How the Approach Spread
From one prompt to a practice
What started as a fix for the support reply prompt became the team's default way of building any model-backed feature. The next project, a prompt that turned sales call notes into structured CRM records, began with the harness and the contract rather than ending with them. The second project shipped in a fraction of the time because the team was no longer learning the lessons; they were applying them.
A shared vocabulary changed the reviews
Once the team had names for the stages and the failure modes, code reviews of prompts got sharper. A reviewer could say "the envelope is fine but you have no exclusion rule and no escape hatch for the no-match case," and everyone knew exactly what that meant. The shared vocabulary, drawn from the kind of framework and checklist any team can adopt, turned prompt review from vague taste into a concrete checklist.
The payoff was confidence, not just pass rate
The number that improved was the automated handling rate, but the deeper payoff was that the team stopped being afraid of model updates. With a harness in place, an update was a thing to measure rather than a thing to dread, and that confidence let them adopt new model versions quickly instead of clinging to a known-good one out of fear.
Frequently Asked Questions
Why did the unconstrained prompt pass the pilot but fail in production?
The pilot judged drafts by human readability, which the model did well. Production judged them by machine parseability, a different and stricter bar that the original prompt never targeted.
What was the single most effective change?
Reframing the output as a structured record with the message as one field, rather than a message with metadata sprinkled in. Every other improvement followed from that reframing.
Why build the test set from messy real tickets instead of clean examples?
Because production is messy. Clean examples would have produced a prompt that passed review and failed on the empty, angry, and multi-issue tickets that make up the long tail.
How did restating the format near the output help?
In long, multi-issue tickets the top-of-prompt format rule lost weight against all the intervening content. Repeating it just before generation restored compliance on exactly the cases that had been failing.
Why remove the body length constraint?
It degraded draft quality without solving a real problem. The harness measured body usefulness, so the regression was visible immediately, and the constraint was not worth the cost.
Is this approach specific to support workflows?
No. Any workflow that feeds model output into another system benefits from treating the output as a contract. Support tickets are just a clear illustration of a general pattern.
Key Takeaways
- Fluent demo output can hide that the output is not reliably machine-consumable.
- Reframing output as a structured record, with prose as one field, was the pivotal decision.
- Lock the envelope tightly and leave genuinely creative fields loose.
- Build the test harness from messy real data before tuning the prompt.
- Restating format constraints near the output fixed failures on long inputs.
- Measuring body usefulness, not just format validity, caught over-constraint early.