A Prompt That Kept Ignoring Its Rules, and the Fix

A mid-sized product team had shipped an AI support assistant that worked well in the demo and steadily worse in production. Three months in, the complaints had a pattern: the assistant gave inconsistent answers, occasionally ignored its escalation policy, and sometimes adopted a tone wildly off from the brand. Nobody had changed the prompt. The team assumed the model had regressed.

This case study follows the full arc of what they found: the situation they inherited, the decision to treat it as an instruction-hierarchy problem rather than a model problem, how they executed the rewrite, the measurable outcome, and the lessons that changed how they wrote prompts afterward.

The names and specifics are generalized, but the diagnostic path and the fixes are exactly the kind we see resolve these issues in practice.

The Situation

The assistant's prompt had grown to roughly forty instructions over three months. Each had been added to fix a specific complaint: a tone rule here, a length rule there, an escalation policy, a list of forbidden topics, formatting preferences, and a set of few-shot examples that predated half the rules.

The Symptoms

Support leads reported three recurring problems. Answers varied in quality from one session to the next on identical questions. The assistant occasionally answered billing questions it was supposed to escalate. And its tone swung between curt and overly chatty depending on the topic.

The Wrong First Theory

The team's first move was to try a newer model version, assuming a regression. Behavior did not improve. That ruled out the model and pointed attention back at the prompt itself.

Reading the Assembled Prompt for the First Time

What finally cracked the case was a simple, overdue exercise: printing the exact text sent to the model, with the system prompt, the user message, the retrieved snippets, and the examples all concatenated as the model actually saw them. Nobody on the team had ever looked at this full rendering. The template they edited and the prompt the model received were two different documents, and all the contradictions lived in the gap between them.

The Contradictions in Plain Sight

Once rendered, the conflicts were obvious. A rule said to keep answers brief; another said to always include full steps. A rule said to answer confidently; another said to hedge on anything uncertain. The escalation policy sat in the middle of the prompt, far from the top, with no statement that it outranked a helpful-sounding user request. None of these were subtle once visible, which is exactly why the rendering step mattered.

The Decision

A review of the assembled prompt revealed the real issue: the forty instructions contained at least a dozen latent conflicts, and the prompt gave the model no way to resolve any of them. The team decided to stop adding rules and instead impose a hierarchy.

Reframing the Problem

Rather than asking "what rule do we need to add," they asked "when two existing rules collide, which should win." That single reframing turned a sprawling list into a tractable design problem, the same shift described in Opinionated Rules for Resolving Prompt Instruction Conflicts.

The Execution

The rewrite happened in four deliberate steps over a week.

Step 1: Inventory and Group

Every instruction was listed and sorted into hard constraints, behavioral rules, and preferences. The escalation policy and forbidden-topics list became hard constraints; tone and formatting became preferences.

Step 2: Declare Precedence

A short block near the top stated the order: hard constraints first, then behavioral rules, then user requests, then preferences. The most critical escalation rule was restated at the very end of the prompt.

Step 3: Reconcile the Examples

The few-shot examples were audited against the new rules. Three contradicted the tone preference and were rewritten; one demonstrating an escalation scenario was added so the policy had a positive example.

Step 4: Build Conflict Tests

The team wrote fifteen test inputs specifically designed to pit rules against each other, including a billing question that should escalate and a request that pushed against a forbidden topic. They tracked how often the higher-priority rule won, using the approach in How to Measure Instruction Hierarchy and Priority Conflicts: Metrics That Matter.

The Outcome

After the rewrite, the conflict test suite went from passing roughly sixty percent of cases to passing every case across repeated runs. In production over the following month, escalation-policy violations dropped to near zero, and support leads stopped reporting tone swings. The prompt was also forty percent shorter, because pruning conflicting rules removed the need for several patch rules.

The Unexpected Benefit

Debugging got dramatically faster. When a new issue surfaced, the team could ask which tier it belonged to and where in the hierarchy it sat, instead of guessing among forty flat rules. The structure they built is essentially the one described in A Framework for Instruction Hierarchy and Priority Conflicts.

What the Numbers Looked Like Over Time

The team kept running the conflict suite weekly after launch. The win rate held at one hundred percent on the original fifteen cases, but production sampling surfaced two new conflict types within the first month that the suite had not anticipated. Both came from unusual user phrasings rather than from the rules themselves. Because the hierarchy was already in place, resolving each one took adding a single explicit override and one test case, not another round of firefighting. The structure turned new conflicts from emergencies into routine maintenance.

The Lessons

The team came away with three durable habits. First, a prompt that misbehaves is a hierarchy problem until proven otherwise. Second, adding rules to fix conflicts makes things worse; resolving the conflict directly makes things better. Third, conflict-specific tests catch what happy-path tests never will.

Changing How New Prompts Were Written

The bigger shift was upstream. New prompts now started with a precedence block and tiered sections from the first draft, rather than growing into a flat pile and being untangled later. The team also adopted a rule that no instruction could be added without first checking whether it conflicted with an existing one, and if so, stating the winner. This turned conflict resolution from a periodic cleanup into a habit applied at the moment each rule was written, which is far cheaper than retrofitting it across forty rules under production pressure.

Why the Approach Generalized

The team initially treated this as a fix for one assistant, but the same pattern resolved issues in two other prompts within the quarter. The reason is that the failure mode is not specific to any domain. Any prompt that accumulates rules without a hierarchy drifts toward the same contradictory state, and the same inventory-tier-test sequence pulls it back. The case study became a template the team reused, not a one-off rescue.

Frequently Asked Questions

Why did swapping the model not help?

Because the problem was in the prompt, not the model. The instructions contradicted each other, and no model can reliably satisfy a directive set that cannot all be true at once.

How long did the rewrite take?

About a week of part-time effort: a day to inventory, a day to declare precedence and reconcile examples, and the rest building and running the conflict test suite.

Was the shorter prompt a goal or a side effect?

A side effect. Pruning came from removing patch rules that only existed to paper over conflicts. Once the conflicts were resolved at the source, those patches were unnecessary.

What single change drove the biggest improvement?

Declaring an explicit precedence order. It gave the model a tiebreaker for every collision and was responsible for most of the jump in test pass rate.

Key Takeaways

An assistant that degrades without prompt changes usually has latent conflicts, not a model regression.
Reframe "what rule should we add" into "which rule wins when two collide."
Inventory and tier every instruction into hard constraints, behavioral rules, and preferences.
Reconcile few-shot examples with the rules and add a positive example for each critical policy.
Conflict-specific test suites expose failures that happy-path testing misses entirely.
Resolving conflicts at the source often shortens the prompt and speeds future debugging.

The names and specifics are generalized, but the diagnostic path and the fixes are exactly the kind we see resolve these issues in practice.

The Situation

The Symptoms

The Wrong First Theory

The team's first move was to try a newer model version, assuming a regression. Behavior did not improve. That ruled out the model and pointed attention back at the prompt itself.

Reading the Assembled Prompt for the First Time

The Contradictions in Plain Sight

The Decision

Reframing the Problem

The Execution

The rewrite happened in four deliberate steps over a week.

Step 1: Inventory and Group

Step 2: Declare Precedence

Step 3: Reconcile the Examples

Step 4: Build Conflict Tests

The Outcome

The Unexpected Benefit

What the Numbers Looked Like Over Time

The Lessons

Changing How New Prompts Were Written

Why the Approach Generalized

Frequently Asked Questions

Why did swapping the model not help?

Because the problem was in the prompt, not the model. The instructions contradicted each other, and no model can reliably satisfy a directive set that cannot all be true at once.

How long did the rewrite take?

About a week of part-time effort: a day to inventory, a day to declare precedence and reconcile examples, and the rest building and running the conflict test suite.

Was the shorter prompt a goal or a side effect?

A side effect. Pruning came from removing patch rules that only existed to paper over conflicts. Once the conflicts were resolved at the source, those patches were unnecessary.

What single change drove the biggest improvement?

Declaring an explicit precedence order. It gave the model a tiebreaker for every collision and was responsible for most of the jump in test pass rate.

Key Takeaways

An assistant that degrades without prompt changes usually has latent conflicts, not a model regression.
Reframe "what rule should we add" into "which rule wins when two collide."
Inventory and tier every instruction into hard constraints, behavioral rules, and preferences.
Reconcile few-shot examples with the rules and add a positive example for each critical policy.
Conflict-specific test suites expose failures that happy-path testing misses entirely.
Resolving conflicts at the source often shortens the prompt and speeds future debugging.

A Prompt That Kept Ignoring Its Rules, and the Fix

The Situation

The Symptoms

The Wrong First Theory

Reading the Assembled Prompt for the First Time

The Contradictions in Plain Sight

The Decision

Reframing the Problem

The Execution

Step 1: Inventory and Group

Step 2: Declare Precedence

Step 3: Reconcile the Examples

Step 4: Build Conflict Tests

The Outcome

The Unexpected Benefit

What the Numbers Looked Like Over Time

The Lessons

Changing How New Prompts Were Written

Why the Approach Generalized

Frequently Asked Questions

Why did swapping the model not help?

How long did the rewrite take?

Was the shorter prompt a goal or a side effect?

What single change drove the biggest improvement?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

A Prompt That Kept Ignoring Its Rules, and the Fix

The Situation

The Symptoms

The Wrong First Theory

Reading the Assembled Prompt for the First Time

The Contradictions in Plain Sight

The Decision

Reframing the Problem

The Execution

Step 1: Inventory and Group

Step 2: Declare Precedence

Step 3: Reconcile the Examples

Step 4: Build Conflict Tests

The Outcome

The Unexpected Benefit

What the Numbers Looked Like Over Time

The Lessons

Changing How New Prompts Were Written

Why the Approach Generalized

Frequently Asked Questions

Why did swapping the model not help?

How long did the rewrite take?

Was the shorter prompt a goal or a side effect?

What single change drove the biggest improvement?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?