AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Stage One: DEFINE the ConstructWhat this stage doesWhen it matters mostStage Two: DETECT With StructureWhat this stage doesDecisions inside this stageStage Three: DOUBT the Hard CasesWhat this stage doesWhy doubt is a featureStage Four: DOCUMENT the EvidenceWhat this stage doesHow documentation pays offPutting the Stages TogetherA minimal templateAdapting the Framework to Different TasksPolarity classification (positive/negative/neutral)Fine-grained emotion detectionAspect-based sentimentCommon Failure Modes and Which Stage Fixes ThemA quick diagnosticFrequently Asked QuestionsIs this framework specific to a particular model?Can I collapse all four stages into one prompt?Which stage do teams most often skip?How does this differ from just writing a detailed prompt?Does requiring quotes slow the system down?How do I know the framework is working?Key Takeaways
Home/Blog/A Reusable Model for Reading Tone in Text at Scale
General

A Reusable Model for Reading Tone in Text at Scale

A

Agency Script Editorial

Editorial Team

·August 4, 2021·6 min read
prompting for sentiment and emotion detectionprompting for sentiment and emotion detection frameworkprompting for sentiment and emotion detection guideprompt engineering

Teams that build sentiment detection prompts one ad-hoc instruction at a time end up with brittle, untraceable systems that work on the demo and fail on the long tail. The problem is not a lack of clever phrasing. It is the absence of a repeatable structure — a model you can apply to any text classification task and get a defensible result.

This article introduces such a model. We call it DEFINE-DETECT-DOUBT-DOCUMENT, four stages that map to the four things every reliable sentiment prompt must do: establish what the labels mean, classify against that meaning, handle the cases that do not fit cleanly, and record evidence for every decision. The names are mnemonic, not magic. What matters is that each stage closes a specific failure mode.

Use this as scaffolding. Drop your domain into each stage and you will have a prompt that survives contact with messy real-world text.

The value of a named model is not the acronym. It is that it gives a team a shared language for diagnosing failures and a guarantee that no critical step gets skipped under deadline pressure. When someone says "the DOUBT stage is weak," everyone knows exactly what is broken and where to look. That shared vocabulary is worth more than any individual clever instruction, because it makes the work repeatable across people and projects rather than locked in one engineer's head.

Stage One: DEFINE the Construct

Most sentiment prompts fail before classification even begins because nobody told the model what the labels mean.

What this stage does

It converts vague labels into observable behavior. "Negative" becomes "an explicit complaint or expression of dissatisfaction toward the product, not the mere presence of a problem." Each label gets a definition and at least one counter-example. The counter-example is the part teams forget and the part that does the most work, because it pins down the exact boundary the model would otherwise guess at. A definition tells the model what a label is; a counter-example tells it what the label is not, which is usually where the errors live.

When it matters most

Always, but especially when your text contains problem-reporting that is emotionally neutral — bug reports, factual returns, technical questions. This stage is the single highest-leverage move, as shown in When a Brand Stopped Trusting Its Review Tagger, We Rebuilt It.

Stage Two: DETECT With Structure

Once meaning is fixed, you ask the model to classify — but the shape of the request controls quality.

What this stage does

It specifies the unit (sentence, turn, document), permits multiple labels with intensity when appropriate, and pins the output format to a strict schema so downstream systems do not break.

Decisions inside this stage

  • One label or several? Mixed text needs several with intensity scores.
  • What is the target of sentiment — product, brand, or the writer's situation?
  • What format does the consumer of this output require?

Concrete phrasings for this stage appear in Concrete Sentiment Prompts That Worked (and the Ones That Backfired).

Stage Three: DOUBT the Hard Cases

The difference between a toy and a production system is how it handles ambiguity.

What this stage does

It gives the model an explicit "uncertain" or "ambiguous" path for cases where signals conflict — sarcasm, mixed emotion, missing context. Those items route to humans instead of getting confident, wrong labels.

Why doubt is a feature

A flagged unknown preserves accuracy on everything else and tells you exactly where the model needs help. Systems that never say "I don't know" are systems that are confidently wrong somewhere you cannot see.

Stage Four: DOCUMENT the Evidence

A label without grounding is unauditable, and unauditable systems lose stakeholder trust.

What this stage does

It requires the model to quote the specific phrase driving each label. That quote improves accuracy (the model must ground its reasoning), enables auditing, and exposes hallucinated logic.

How documentation pays off

When a stakeholder disputes a label, you point to the quote. When you debug a systematic error, the quotes reveal the pattern. When you measure quality, the quotes anchor your evaluation, which connects directly to Reading the Signal: Scoring Sentiment Systems You Can Trust.

Putting the Stages Together

A complete prompt walks through all four stages in order: it defines the labels, requests structured detection, offers a doubt path, and demands documented evidence. You can compress them into a single prompt or split them across steps for complex tasks.

A minimal template

  • DEFINE: each label as behavior plus a counter-example
  • DETECT: unit, multi-label rule, target, output schema
  • DOUBT: explicit "uncertain" path with a reason
  • DOCUMENT: a required supporting quote per label

Adapting the Framework to Different Tasks

The four stages stay constant, but their weight shifts with the task in front of you. Knowing which stage to lean on saves effort.

Polarity classification (positive/negative/neutral)

DEFINE carries most of the load here. Once you nail behavioral definitions and a counter-example for the calm complaint, the other stages are light. DETECT is usually single-label, and DOUBT handles only the rare genuinely-mixed case.

Fine-grained emotion detection

DETECT becomes heavy: multi-label, intensity scoring, and a clear target all matter. DOUBT grows too, because adjacent emotions blur and you want the model to flag rather than force a choice. This is the harder task and benefits most from the full structure.

Aspect-based sentiment

When you need sentiment per feature ("battery good, screen bad"), DETECT must specify the aspects and tie each label to one. DOCUMENT earns its keep by quoting the phrase per aspect, which keeps the per-feature labels honest.

Common Failure Modes and Which Stage Fixes Them

Most problems map cleanly to a missing or weak stage. Diagnosing by stage turns vague frustration into a specific repair.

A quick diagnostic

  • Neutral problem-reports tagged negative? Strengthen DEFINE.
  • Mixed-emotion text getting one forced label? Fix DETECT's multi-label rule.
  • Confident, wrong labels on sarcasm? Add or widen the DOUBT path.
  • Stakeholders disputing labels with no way to check? Enforce DOCUMENT.
  • Labels look right but trend reports feel off? Check intensity calibration, a DETECT-and-measure problem covered in Reading the Signal: Scoring Sentiment Systems You Can Trust.

This stage-to-failure mapping is what makes the model reusable: you are never staring at a broken system wondering where to start. You ask which stage the failure belongs to and fix that one. The same diagnostic underpins the launch list in Every Step We Run Before Shipping Tone Detection in 2026.

Frequently Asked Questions

Is this framework specific to a particular model?

No. The four stages address failure modes inherent to the task, not to any one model. The exact wording you use in each stage should be re-tested when you switch models, but the structure carries over.

Can I collapse all four stages into one prompt?

Yes, and for most tasks you should — a single well-structured prompt that defines, detects, doubts, and documents. Split the stages into separate steps only when the task is complex enough that one prompt becomes unreliable.

Which stage do teams most often skip?

DEFINE and DOUBT. Teams jump straight to detection, then wonder why the model confuses problem-reporting with negativity and why it never flags ambiguous cases. Those two stages prevent the majority of real-world errors.

How does this differ from just writing a detailed prompt?

A detailed prompt without structure can still omit a critical element. The framework guarantees you address all four failure modes — undefined labels, unstructured output, unhandled ambiguity, and ungrounded decisions — rather than hoping you remembered them.

Does requiring quotes slow the system down?

Marginally, in output length and cost. The trade is worth it: grounding improves accuracy and makes every decision auditable. If cost is critical, you can drop quotes in production but keep them during validation.

How do I know the framework is working?

Run a hand-labeled evaluation set before and after applying it. Agreement with human labels should rise, the confident-error rate should fall, and your "uncertain" queue should contain the genuinely hard cases. If those signals do not move, your stage wording needs work.

Key Takeaways

  • DEFINE-DETECT-DOUBT-DOCUMENT closes the four failure modes of sentiment prompts
  • DEFINE converts vague labels into observable behavior with counter-examples
  • DETECT structures the request: unit, multi-label rule, target, and output schema
  • DOUBT gives the model an explicit path for ambiguous cases, routed to humans
  • DOCUMENT requires a supporting quote, making every label accurate and auditable
  • The structure is model-agnostic; only the exact wording needs re-testing per model

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification