Most advice about sentiment analysis stops at theory. You read that you should "be specific" and "define your labels," then you stare at a blank prompt window with a thousand support tickets to classify and no idea what specific actually looks like. The gap between principle and practice is where most projects stall.
This article skips the abstractions. Below are concrete prompts pulled from real classification work — customer support, product reviews, sales call transcripts, and social monitoring — paired with the version that failed first and the version that fixed it. Each example shows the exact phrasing change that moved accuracy, because the difference between a prompt that works and one that quietly produces garbage is often a single sentence.
Read these less as templates to copy verbatim and more as a pattern library. The wording will differ for your domain, but the failure modes repeat across almost every team that attempts this work.
Classifying Support Tickets by Frustration Level
A SaaS support team wanted to triage incoming tickets by how upset the customer was, so angry messages got a human faster.
What failed first
The initial prompt said: "Read this support ticket and tell me if the customer is happy, neutral, or angry." It returned "angry" for any ticket containing words like "broken," "error," or "not working" — even when the customer was calm and descriptive. The model was matching vocabulary, not affect.
What fixed it
Adding a definition of the construct changed everything: "Frustration here means emotional escalation — blame, threats to cancel, all-caps, repeated punctuation, or sarcasm — not merely reporting that something is broken. A calm bug report is neutral." Accuracy on a 200-ticket hand-labeled set jumped because the model stopped conflating problem-reporting with emotion.
- Define the emotion as a behavior, not a topic
- Give an explicit counter-example (calm bug report = neutral)
- Name the surface signals that indicate escalation
Detecting Emotion in Product Reviews
A retail brand wanted to go beyond star ratings and tag reviews with specific emotions: delight, disappointment, regret, relief.
The mixed-emotion problem
Single-label prompts forced the model to pick one emotion for reviews that clearly contained two ("Shipping was a nightmare but the product is incredible"). The fix was permitting multiple labels with intensity: "Return up to two emotions, each with a 1-5 intensity, only if clearly expressed. If an emotion is implied but weak, omit it." This matched how people actually write and cut the rate of forced, wrong single labels.
For deeper guidance on structuring multi-label output, see A Reusable Model for Reading Tone in Text at Scale.
Reading Sentiment in Sales Call Transcripts
Transcripts are messy: filler words, interruptions, two speakers. A revenue team wanted per-speaker sentiment trends across a call.
Separating the speakers
The prompt that worked instructed the model to ignore the rep entirely and score only prospect turns: "Analyze only lines labeled PROSPECT. Score each as positive, neutral, or negative toward the deal, and quote the phrase driving your score." Requiring a supporting quote did double duty — it improved accuracy and gave reviewers something to audit.
Social Monitoring Where Sarcasm Lives
Brand mentions on social media are the hardest case because sarcasm inverts literal sentiment.
Forcing the model to flag uncertainty
Rather than chase perfect sarcasm detection, the winning approach added an explicit escape hatch: "If the literal meaning and likely intent conflict (possible sarcasm or irony), label it 'ambiguous' and explain the conflict." Ambiguous items routed to a human. This traded coverage for trust — and trust is what made the system shippable. The same restraint shows up when you study When a Brand Stopped Trusting Its Review Tagger, We Rebuilt It.
A Side-by-Side of Weak vs. Strong Phrasing
The pattern across all four cases is the same. Weak prompts name a topic and a label set. Strong prompts define the construct behaviorally, supply counter-examples, allow uncertainty, and demand evidence.
Quick reference
- Weak: "Is this positive or negative?"
- Strong: "Classify sentiment toward [target]. Positive = explicit approval or satisfaction. Negative = explicit complaint or dissatisfaction. Neutral = factual with no clear stance. If unclear, return 'uncertain.' Quote the phrase driving your decision."
Notice how much the strong version carries that the weak one assumes. It names the target of sentiment, defines each label in terms of what the writer actually does, provides an escape hatch, and demands evidence. None of that is clever wording — it is simply refusing to leave the model guessing at the parts a human reviewer would have clarified instinctively. Every example above is a variation on closing one of those gaps.
When you are ready to verify these gains hold up, the instrumentation in Reading the Signal: Scoring Sentiment Systems You Can Trust shows how to measure whether a phrasing change actually helped.
Handling Domain Jargon and Insider Language
Generic prompts stumble badly on text full of product names, abbreviations, and industry shorthand, because the model has no idea whether a term is praise, a complaint, or neutral description.
Where it breaks
A B2B software company found its prompt tagging neutral feature requests as negative because phrases like "the API throttles us" and "we got rate-limited" read as complaints to a model that did not know these were ordinary technical descriptions. The polarity was wrong roughly a third of the time on jargon-heavy tickets.
What fixed it
A short glossary embedded in the prompt resolved most of it: "In this domain, 'rate-limited,' 'throttled,' and 'sandboxed' are neutral technical states unless paired with explicit frustration. Treat them as neutral descriptions of system behavior." Feeding the model the same shared vocabulary your team uses closes the gap between topic and tone.
- List the domain terms the model will misread
- State their default polarity (usually neutral)
- Note the conditions that flip them negative
Catching Implicit and Understated Negativity
Not every complaint is loud. Some of the most important negative signals are polite, understated, or buried in a backhanded compliment.
The understatement trap
A hospitality brand kept missing quietly dissatisfied guests because reviews like "It was fine, I suppose, for the price" scored neutral or even positive on the word "fine." These lukewarm reviews predicted churn better than the loud one-star rants, yet the prompt sailed past them.
The fix that surfaced them
The instruction that worked named the pattern directly: "Treat hedged or faint praise — 'fine,' 'okay,' 'I guess,' 'for the price' — as mildly negative signals of unmet expectations, not as positive. Flag these as 'lukewarm.'" Adding a dedicated lukewarm band gave the team an early-warning lane they had been blind to. This kind of nuance is exactly what coarse three-bucket systems miss, and it connects to the granularity argument in Granular Emotion and Honest Uncertainty Are Reshaping Tone Detection.
Frequently Asked Questions
Why does the model label calm bug reports as angry?
Because it pattern-matches negative vocabulary to negative emotion. Words like "broken," "failed," and "error" are topically negative but emotionally neutral. Fix it by defining the emotion as a behavior — escalation, blame, threats — and giving an explicit example of a calm complaint that should score neutral.
Should I allow multiple emotion labels per item?
Yes, when your text contains mixed emotions, which most real-world reviews and messages do. Cap the number (usually two) and require an intensity score so weak signals get filtered. Forcing a single label on genuinely mixed text manufactures errors.
How do I handle sarcasm without a perfect detector?
Do not try to solve sarcasm perfectly. Give the model an "ambiguous" option for cases where literal meaning and likely intent conflict, then route those to human review. Flagging uncertainty is more valuable than guessing confidently and being wrong.
Why require the model to quote supporting text?
A required quote forces the model to ground its label in evidence, which improves accuracy and gives human reviewers a fast way to audit decisions. It also exposes hallucinated reasoning, because the model must point to actual words in the input.
Do these examples work the same across different models?
The principles transfer, but exact phrasing should be re-tested when you switch models. A definition that disambiguates well for one model may be redundant or insufficient for another. Treat every model change as a reason to re-run your labeled evaluation set.
How large should my test set be before trusting a prompt?
Aim for at least 100-200 hand-labeled examples that reflect your real distribution, including hard and ambiguous cases. Smaller sets produce noisy accuracy estimates that can make a worse prompt look better by chance.
Key Takeaways
- Define emotions and sentiment as behaviors and signals, not as topics or vocabulary
- Supply counter-examples — especially the calm complaint that should score neutral
- Allow multiple labels with intensity when your text contains mixed emotions
- Give the model an explicit "ambiguous" or "uncertain" option and route those to humans
- Require a supporting quote so every label is grounded and auditable
- Re-test exact phrasing on a labeled set whenever you change models