A German Retailer's Rewrite of Its Customer-Service Prompts

A mid-market home-goods retailer headquartered in Germany had built a customer-service assistant that worked well in its home market. When the company expanded into France, Spain, Italy, and the Netherlands, it reused the same assistant, translating the prompts into each language. Within a quarter, customer-satisfaction scores told a confusing story: high in Germany and the Netherlands, mediocre in France, and noticeably low in Spain and Italy.

The team's first instinct was that translation quality was to blame. It was not. The translations were grammatically clean. The problem was deeper and more interesting: the prompts encoded a specifically German set of communication assumptions that traveled badly. This is the story of how they diagnosed that, what they changed, and what the rewrite produced.

The details below are a representative composite of how this kind of expansion failure unfolds and gets resolved. The arc, the decisions, and the lessons reflect patterns we see repeatedly, presented without invented precision. We have kept the focus on the reasoning at each turn rather than on numbers, because the transferable lesson is in how the team thought, not in figures specific to one retailer.

The Situation

A Working System, Reused Carelessly

The original assistant was tuned for German customer-service norms: direct, efficient, information-first, with minimal small talk. German and Dutch customers rated it highly because that style matched their expectations. The team assumed the style was simply "good service" rather than "good German service."

The Divergence

When satisfaction scores split by market, the team initially looked at response accuracy, latency, and resolution rate. All were comparable across markets. The metric that diverged was tone-related sentiment in the post-interaction survey, especially in Spain and Italy, where free-text comments mentioned the assistant feeling "cold" or "abrupt."

The Diagnosis

Reading the Free-Text Signal

The breakthrough came from reading the survey comments rather than the numeric scores. The pattern was unmistakable: customers in higher-context cultures experienced the same efficient replies as curt. The information was correct, but the framing violated their expectations for warmth and relationship before transaction.

Locating the Assumption in the Prompt

The team traced the issue to a single instruction in the system prompt: "Answer the customer's question directly and concisely. Avoid unnecessary pleasantries." That instruction was a German communication norm written as if it were universal. The signal-reading approach they used is the one we describe in Reading the Signals That Tell You a Prompt Misread a Culture.

The Decision

Parameterize Communication Style

Rather than fork the prompt into five hand-tuned copies, the team made communication style an explicit parameter with values calibrated per market: high-context and warm for Spain and Italy, balanced for France, direct for Germany and the Netherlands. This kept one prompt architecture while allowing cultural variation.

Add a Native-Reviewer Calibration Phase

For each market, a fluent reviewer evaluated a sample of generated responses against local service expectations and adjusted the style parameter until the tone fit. This calibration loop is the practice we recommend in Designing Prompts That Travel Across Languages and Locales.

The Execution

Rolling Out by Market

The team did not change all five markets at once. They started with Spain, where the problem was worst, calibrated the style parameter with a native reviewer, validated the change with a small A/B test against the old prompt, and only then moved to Italy and France.

Building a Cultural Test Set

Alongside the rollout, they built an adversarial test set of inputs designed to expose tone failures, so that future prompt edits would not silently regress a market they had already fixed. This protected the investment as the prompt continued to evolve.

The Outcome

Satisfaction Converged

After calibration, tone-related sentiment in Spain and Italy rose to match the levels in Germany and the Netherlands. The free-text comments shifted from "cold" and "abrupt" toward "helpful" and "friendly." The numeric resolution metrics, which had always been fine, stayed fine.

The Architecture Paid Off Later

When the company expanded into Portugal the following year, adding the market was a matter of setting the style parameter and running one calibration pass, not a from-scratch rewrite. The parameterized design turned a recurring expansion cost into a configuration step.

A Smaller Support Burden

There was a second, less obvious payoff. Because the tone now fit each market, the volume of escalations where customers asked to speak to a human out of frustration dropped in the previously low-scoring markets. The assistant resolved more interactions on its own, not because it got smarter but because it stopped antagonizing users with a mismatched tone. Cultural fit, it turned out, was partly a deflection-rate problem in disguise.

What the Team Got Wrong First

Blaming Translation Quality

The team's first hypothesis, that the translations were poor, sent them down a dead end for a few weeks. They commissioned translation reviews that came back clean, which was frustrating until it became the clue: if the words were right and the experience was still wrong, the problem had to be above the level of words. Naming that ruled-out hypothesis is what redirected them toward communication style.

Trusting the Numeric Average

They also spent time staring at aggregate satisfaction scores that told them only that Spain and Italy were lower, not why. The aggregate hid the texture. The lesson they drew, and that we draw with them, is to read the qualitative signal early rather than treating it as a last resort after the numbers refuse to explain themselves.

Assuming the Home Style Was the Standard

The most instructive mistake was the quietest one: the team had never questioned whether their German service style was a choice. It felt like simply doing service well. Recognizing that "good service" was actually "good German service" required a deliberate act of stepping outside their own frame, which is hard precisely because the frame is invisible from inside it. The whole episode is a reminder that the most dangerous cultural assumptions are the ones a team does not know it is making.

Lessons

"Good Service" Is Culturally Specific

The deepest lesson was that the team had mistaken a cultural norm for a universal best practice. Direct, efficient service is excellent service in Germany and a tone problem in Spain. Naming that assumption was the entire fix.

Numeric Scores Hid the Problem; Free Text Revealed It

Resolution rate and accuracy looked identical across markets. Only the qualitative survey comments exposed the tone failure. Cultural problems often live in the signal that aggregate numbers smooth over, which is why the team learned to read free text first.

Frequently Asked Questions

Why did translation alone not fix the problem?

Because the problem was not in the words but in the communication style the prompt enforced. A perfectly translated instruction to be direct and skip pleasantries still produces curt-feeling output in a high-context culture. Translation moves the words; it does not move the cultural posture.

How did the team know it was a cultural issue and not a model issue?

The numeric performance metrics were identical across markets, ruling out model capability. The divergence appeared only in tone-related sentiment and only in specific cultures, which pointed squarely at a communication-style mismatch rather than a technical fault.

Was parameterizing style risky compared to forking prompts?

Parameterizing was lower risk over time. Forking five prompts means five things to maintain and five places for regressions to hide. A single architecture with a style parameter localized the cultural decision to one place and made it testable.

How long did the calibration phase take per market?

In this composite, calibration per market was a short cycle: a native reviewer evaluating a sample, adjusting the style parameter, and validating with a small A/B test. The work scaled down as the team reused the process across markets.

What metric finally told them the rewrite worked?

Tone-related sentiment in the post-interaction survey, especially the free-text comments. When "cold" and "abrupt" gave way to "helpful" and "friendly" in the previously low-scoring markets, they had their confirmation.

Could they have caught this before launch?

Yes, with a native-reviewer step and a cultural test set during the initial expansion rather than after satisfaction scores diverged. The case is a reminder that the calibration loop belongs before launch, not after the scores tell you something is wrong.

Key Takeaways

A working prompt reused across cultures failed because it encoded one market's communication norms as universal.
Numeric metrics looked fine across all markets; only free-text sentiment revealed the tone failure.
The fix was to make communication style an explicit, per-market parameter rather than forking prompts.
A native-reviewer calibration phase plus a cultural test set turned the fix into a repeatable, regression-proof process.
Parameterizing culture converted future market expansion from a rewrite into a configuration step.

The Situation

A Working System, Reused Carelessly

The Divergence

The Diagnosis

Reading the Free-Text Signal

Locating the Assumption in the Prompt

The Decision

Parameterize Communication Style

Add a Native-Reviewer Calibration Phase

The Execution

Rolling Out by Market

Building a Cultural Test Set

The Outcome

Satisfaction Converged

The Architecture Paid Off Later

A Smaller Support Burden

What the Team Got Wrong First

Blaming Translation Quality

Trusting the Numeric Average

Assuming the Home Style Was the Standard

Lessons

"Good Service" Is Culturally Specific

Numeric Scores Hid the Problem; Free Text Revealed It

Frequently Asked Questions

Why did translation alone not fix the problem?

How did the team know it was a cultural issue and not a model issue?

Was parameterizing style risky compared to forking prompts?

How long did the calibration phase take per market?

What metric finally told them the rewrite worked?

Could they have caught this before launch?

Key Takeaways

A working prompt reused across cultures failed because it encoded one market's communication norms as universal.
Numeric metrics looked fine across all markets; only free-text sentiment revealed the tone failure.
The fix was to make communication style an explicit, per-market parameter rather than forking prompts.
A native-reviewer calibration phase plus a cultural test set turned the fix into a repeatable, regression-proof process.
Parameterizing culture converted future market expansion from a rewrite into a configuration step.

A German Retailer's Rewrite of Its Customer-Service Prompts

The Situation

A Working System, Reused Carelessly

The Divergence

The Diagnosis

Reading the Free-Text Signal

Locating the Assumption in the Prompt

The Decision

Parameterize Communication Style

Add a Native-Reviewer Calibration Phase

The Execution

Rolling Out by Market

Building a Cultural Test Set

The Outcome

Satisfaction Converged

The Architecture Paid Off Later

A Smaller Support Burden

What the Team Got Wrong First

Blaming Translation Quality

Trusting the Numeric Average

Assuming the Home Style Was the Standard

Lessons

"Good Service" Is Culturally Specific

Numeric Scores Hid the Problem; Free Text Revealed It

Frequently Asked Questions

Why did translation alone not fix the problem?

How did the team know it was a cultural issue and not a model issue?

Was parameterizing style risky compared to forking prompts?

How long did the calibration phase take per market?

What metric finally told them the rewrite worked?

Could they have caught this before launch?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

A German Retailer's Rewrite of Its Customer-Service Prompts

The Situation

A Working System, Reused Carelessly

The Divergence

The Diagnosis

Reading the Free-Text Signal

Locating the Assumption in the Prompt

The Decision

Parameterize Communication Style

Add a Native-Reviewer Calibration Phase

The Execution

Rolling Out by Market

Building a Cultural Test Set

The Outcome

Satisfaction Converged

The Architecture Paid Off Later

A Smaller Support Burden

What the Team Got Wrong First

Blaming Translation Quality

Trusting the Numeric Average

Assuming the Home Style Was the Standard

Lessons

"Good Service" Is Culturally Specific

Numeric Scores Hid the Problem; Free Text Revealed It

Frequently Asked Questions

Why did translation alone not fix the problem?

How did the team know it was a cultural issue and not a model issue?

Was parameterizing style risky compared to forking prompts?

How long did the calibration phase take per market?

What metric finally told them the rewrite worked?

Could they have caught this before launch?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?