A Reusable Model for Reading Tone in Text at Scale

Teams that build sentiment detection prompts one ad-hoc instruction at a time end up with brittle, untraceable systems that work on the demo and fail on the long tail. The problem is not a lack of clever phrasing. It is the absence of a repeatable structure — a model you can apply to any text classification task and get a defensible result.

This article introduces such a model. We call it DEFINE-DETECT-DOUBT-DOCUMENT, four stages that map to the four things every reliable sentiment prompt must do: establish what the labels mean, classify against that meaning, handle the cases that do not fit cleanly, and record evidence for every decision. The names are mnemonic, not magic. What matters is that each stage closes a specific failure mode.

Use this as scaffolding. Drop your domain into each stage and you will have a prompt that survives contact with messy real-world text.

The value of a named model is not the acronym. It is that it gives a team a shared language for diagnosing failures and a guarantee that no critical step gets skipped under deadline pressure. When someone says "the DOUBT stage is weak," everyone knows exactly what is broken and where to look. That shared vocabulary is worth more than any individual clever instruction, because it makes the work repeatable across people and projects rather than locked in one engineer's head.

Stage One: DEFINE the Construct

Most sentiment prompts fail before classification even begins because nobody told the model what the labels mean.

What this stage does

It converts vague labels into observable behavior. "Negative" becomes "an explicit complaint or expression of dissatisfaction toward the product, not the mere presence of a problem." Each label gets a definition and at least one counter-example. The counter-example is the part teams forget and the part that does the most work, because it pins down the exact boundary the model would otherwise guess at. A definition tells the model what a label is; a counter-example tells it what the label is not, which is usually where the errors live.

When it matters most

Always, but especially when your text contains problem-reporting that is emotionally neutral — bug reports, factual returns, technical questions. This stage is the single highest-leverage move, as shown in When a Brand Stopped Trusting Its Review Tagger, We Rebuilt It.

Stage Two: DETECT With Structure

Once meaning is fixed, you ask the model to classify — but the shape of the request controls quality.

What this stage does

It specifies the unit (sentence, turn, document), permits multiple labels with intensity when appropriate, and pins the output format to a strict schema so downstream systems do not break.

Decisions inside this stage

One label or several? Mixed text needs several with intensity scores.
What is the target of sentiment — product, brand, or the writer's situation?
What format does the consumer of this output require?

Concrete phrasings for this stage appear in Concrete Sentiment Prompts That Worked (and the Ones That Backfired).

Stage Three: DOUBT the Hard Cases

The difference between a toy and a production system is how it handles ambiguity.

What this stage does

It gives the model an explicit "uncertain" or "ambiguous" path for cases where signals conflict — sarcasm, mixed emotion, missing context. Those items route to humans instead of getting confident, wrong labels.

Why doubt is a feature

A flagged unknown preserves accuracy on everything else and tells you exactly where the model needs help. Systems that never say "I don't know" are systems that are confidently wrong somewhere you cannot see.

Stage Four: DOCUMENT the Evidence

A label without grounding is unauditable, and unauditable systems lose stakeholder trust.

What this stage does

It requires the model to quote the specific phrase driving each label. That quote improves accuracy (the model must ground its reasoning), enables auditing, and exposes hallucinated logic.

How documentation pays off

When a stakeholder disputes a label, you point to the quote. When you debug a systematic error, the quotes reveal the pattern. When you measure quality, the quotes anchor your evaluation, which connects directly to Reading the Signal: Scoring Sentiment Systems You Can Trust.

Putting the Stages Together

A complete prompt walks through all four stages in order: it defines the labels, requests structured detection, offers a doubt path, and demands documented evidence. You can compress them into a single prompt or split them across steps for complex tasks.

A minimal template

DEFINE: each label as behavior plus a counter-example
DETECT: unit, multi-label rule, target, output schema
DOUBT: explicit "uncertain" path with a reason
DOCUMENT: a required supporting quote per label

Adapting the Framework to Different Tasks

The four stages stay constant, but their weight shifts with the task in front of you. Knowing which stage to lean on saves effort.

Polarity classification (positive/negative/neutral)

DEFINE carries most of the load here. Once you nail behavioral definitions and a counter-example for the calm complaint, the other stages are light. DETECT is usually single-label, and DOUBT handles only the rare genuinely-mixed case.

Fine-grained emotion detection

DETECT becomes heavy: multi-label, intensity scoring, and a clear target all matter. DOUBT grows too, because adjacent emotions blur and you want the model to flag rather than force a choice. This is the harder task and benefits most from the full structure.

Aspect-based sentiment

When you need sentiment per feature ("battery good, screen bad"), DETECT must specify the aspects and tie each label to one. DOCUMENT earns its keep by quoting the phrase per aspect, which keeps the per-feature labels honest.

Common Failure Modes and Which Stage Fixes Them

Most problems map cleanly to a missing or weak stage. Diagnosing by stage turns vague frustration into a specific repair.

A quick diagnostic

Neutral problem-reports tagged negative? Strengthen DEFINE.
Mixed-emotion text getting one forced label? Fix DETECT's multi-label rule.
Confident, wrong labels on sarcasm? Add or widen the DOUBT path.
Stakeholders disputing labels with no way to check? Enforce DOCUMENT.
Labels look right but trend reports feel off? Check intensity calibration, a DETECT-and-measure problem covered in Reading the Signal: Scoring Sentiment Systems You Can Trust.

This stage-to-failure mapping is what makes the model reusable: you are never staring at a broken system wondering where to start. You ask which stage the failure belongs to and fix that one. The same diagnostic underpins the launch list in Every Step We Run Before Shipping Tone Detection in 2026.

Frequently Asked Questions

Is this framework specific to a particular model?

No. The four stages address failure modes inherent to the task, not to any one model. The exact wording you use in each stage should be re-tested when you switch models, but the structure carries over.

Can I collapse all four stages into one prompt?

Yes, and for most tasks you should — a single well-structured prompt that defines, detects, doubts, and documents. Split the stages into separate steps only when the task is complex enough that one prompt becomes unreliable.

Which stage do teams most often skip?

DEFINE and DOUBT. Teams jump straight to detection, then wonder why the model confuses problem-reporting with negativity and why it never flags ambiguous cases. Those two stages prevent the majority of real-world errors.

How does this differ from just writing a detailed prompt?

A detailed prompt without structure can still omit a critical element. The framework guarantees you address all four failure modes — undefined labels, unstructured output, unhandled ambiguity, and ungrounded decisions — rather than hoping you remembered them.

Does requiring quotes slow the system down?

Marginally, in output length and cost. The trade is worth it: grounding improves accuracy and makes every decision auditable. If cost is critical, you can drop quotes in production but keep them during validation.

How do I know the framework is working?

Run a hand-labeled evaluation set before and after applying it. Agreement with human labels should rise, the confident-error rate should fall, and your "uncertain" queue should contain the genuinely hard cases. If those signals do not move, your stage wording needs work.

Key Takeaways

DEFINE-DETECT-DOUBT-DOCUMENT closes the four failure modes of sentiment prompts
DEFINE converts vague labels into observable behavior with counter-examples
DETECT structures the request: unit, multi-label rule, target, and output schema
DOUBT gives the model an explicit path for ambiguous cases, routed to humans
DOCUMENT requires a supporting quote, making every label accurate and auditable
The structure is model-agnostic; only the exact wording needs re-testing per model

Use this as scaffolding. Drop your domain into each stage and you will have a prompt that survives contact with messy real-world text.

Stage One: DEFINE the Construct

Most sentiment prompts fail before classification even begins because nobody told the model what the labels mean.

What this stage does

When it matters most

Stage Two: DETECT With Structure

Once meaning is fixed, you ask the model to classify — but the shape of the request controls quality.

What this stage does

It specifies the unit (sentence, turn, document), permits multiple labels with intensity when appropriate, and pins the output format to a strict schema so downstream systems do not break.

Decisions inside this stage

One label or several? Mixed text needs several with intensity scores.
What is the target of sentiment — product, brand, or the writer's situation?
What format does the consumer of this output require?

Concrete phrasings for this stage appear in Concrete Sentiment Prompts That Worked (and the Ones That Backfired).

Stage Three: DOUBT the Hard Cases

The difference between a toy and a production system is how it handles ambiguity.

What this stage does

Why doubt is a feature

Stage Four: DOCUMENT the Evidence

A label without grounding is unauditable, and unauditable systems lose stakeholder trust.

What this stage does

It requires the model to quote the specific phrase driving each label. That quote improves accuracy (the model must ground its reasoning), enables auditing, and exposes hallucinated logic.

How documentation pays off

Putting the Stages Together

A minimal template

DEFINE: each label as behavior plus a counter-example
DETECT: unit, multi-label rule, target, output schema
DOUBT: explicit "uncertain" path with a reason
DOCUMENT: a required supporting quote per label

Adapting the Framework to Different Tasks

The four stages stay constant, but their weight shifts with the task in front of you. Knowing which stage to lean on saves effort.

Polarity classification (positive/negative/neutral)

Fine-grained emotion detection

Aspect-based sentiment

Common Failure Modes and Which Stage Fixes Them

Most problems map cleanly to a missing or weak stage. Diagnosing by stage turns vague frustration into a specific repair.

A quick diagnostic

Neutral problem-reports tagged negative? Strengthen DEFINE.
Mixed-emotion text getting one forced label? Fix DETECT's multi-label rule.
Confident, wrong labels on sarcasm? Add or widen the DOUBT path.
Stakeholders disputing labels with no way to check? Enforce DOCUMENT.
Labels look right but trend reports feel off? Check intensity calibration, a DETECT-and-measure problem covered in Reading the Signal: Scoring Sentiment Systems You Can Trust.

Frequently Asked Questions

Is this framework specific to a particular model?

Can I collapse all four stages into one prompt?

Which stage do teams most often skip?

How does this differ from just writing a detailed prompt?

Does requiring quotes slow the system down?

How do I know the framework is working?

Key Takeaways

DEFINE-DETECT-DOUBT-DOCUMENT closes the four failure modes of sentiment prompts
DEFINE converts vague labels into observable behavior with counter-examples
DETECT structures the request: unit, multi-label rule, target, and output schema
DOUBT gives the model an explicit path for ambiguous cases, routed to humans
DOCUMENT requires a supporting quote, making every label accurate and auditable
The structure is model-agnostic; only the exact wording needs re-testing per model

A Reusable Model for Reading Tone in Text at Scale

Stage One: DEFINE the Construct

What this stage does

When it matters most

Stage Two: DETECT With Structure

What this stage does

Decisions inside this stage

Stage Three: DOUBT the Hard Cases

What this stage does

Why doubt is a feature

Stage Four: DOCUMENT the Evidence

What this stage does

How documentation pays off

Putting the Stages Together

A minimal template

Adapting the Framework to Different Tasks

Polarity classification (positive/negative/neutral)

Fine-grained emotion detection

Aspect-based sentiment

Common Failure Modes and Which Stage Fixes Them

A quick diagnostic

Frequently Asked Questions

Is this framework specific to a particular model?

Can I collapse all four stages into one prompt?

Which stage do teams most often skip?

How does this differ from just writing a detailed prompt?

Does requiring quotes slow the system down?

How do I know the framework is working?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

A Reusable Model for Reading Tone in Text at Scale

Stage One: DEFINE the Construct

What this stage does

When it matters most

Stage Two: DETECT With Structure

What this stage does

Decisions inside this stage

Stage Three: DOUBT the Hard Cases

What this stage does

Why doubt is a feature

Stage Four: DOCUMENT the Evidence

What this stage does

How documentation pays off

Putting the Stages Together

A minimal template

Adapting the Framework to Different Tasks

Polarity classification (positive/negative/neutral)

Fine-grained emotion detection

Aspect-based sentiment

Common Failure Modes and Which Stage Fixes Them

A quick diagnostic

Frequently Asked Questions

Is this framework specific to a particular model?

Can I collapse all four stages into one prompt?

Which stage do teams most often skip?

How does this differ from just writing a detailed prompt?

Does requiring quotes slow the system down?

How do I know the framework is working?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?