A model that should work but does not is one of the most demoralizing problems in machine learning. The architecture is standard, the team is competent, the data volume is plenty, and yet the thing keeps misclassifying in ways nobody can explain. The instinct is to blame the model. The cause is usually the labels.
This is a composite case study drawn from a pattern that recurs constantly: a support-ticket classifier that stalled, the team's mistaken first diagnosis, the labeling overhaul that actually fixed it, and the numbers that turned around afterward. It is a story, but every beat reflects how these failures and recoveries really unfold.
Read it as a single connected arc, because the lesson is in the sequence. The data labeling and annotation basics case study below is ultimately about looking in the right place.
The Situation: A Model That Would Not Improve
The team had built a classifier to route incoming support tickets into three buckets: refund request, technical issue, or general question. Routing the right tickets to the right teams would cut response times meaningfully.
The model hit a ceiling around mediocre accuracy and refused to climb. Adding more training data barely moved it. The team tried larger architectures, more epochs, and fancier features. Nothing helped, which was itself the clue: when more data and bigger models stop helping, the signal itself is broken.
The error pattern held another clue the team initially overlooked. The mistakes were not spread evenly across all three categories. They piled up on a single pair, refund versus technical, while the general-question bucket performed fine. A localized error cluster like that rarely comes from model capacity, which would degrade everywhere at once. It usually comes from a specific defect in the data for those specific categories.
The Wrong Diagnosis
The first hypothesis was that the model was too small. The second was that the features were weak. The team spent two sprints on modeling changes and got nowhere.
The breakthrough came when an engineer pulled a random hundred tickets and labeled them by hand, cold, without seeing the existing labels. She then compared her labels to the dataset's. They disagreed nearly a third of the time. The training data was not teaching the model the wrong thing occasionally; it was contradicting itself constantly.
The real problem surfaces
Most disagreements clustered on one boundary: tickets that were both a refund request and a technical issue. "It broke and I want my money back" had been labeled "refund" by some annotators and "technical" by others, with no rule to settle it. The schema had a hole, and the model was drowning in the contradiction. This is precisely the ambiguity failure described in our Seven Ways Teams Quietly Poison Their Training Data.
The Decision: Fix the Labels, Not the Model
The team paused all modeling work. They accepted that no architecture could learn a contradictory signal and committed to a labeling overhaul instead. This was uncomfortable, because it meant admitting two sprints had been aimed at the wrong target.
They followed a deliberate sequence, essentially the one in our Step-by-Step Approach to Data Labeling and Annotation Basics.
There was an organizational cost to this decision, and it is worth naming. Pausing modeling work to redo labels felt like going backward, and the team had to defend it to stakeholders who had been promised a shipping model. The lead made the case in plain terms: the model was learning to be confused because the data was confused, and no amount of additional modeling could teach it to resolve a contradiction the humans had never resolved. Framed that way, the overhaul stopped looking like a step back and started looking like the only step forward.
The Execution
First, they rewrote the schema to resolve the overlap explicitly: a ticket mentioning both a refund and a technical fault would be labeled "technical," because solving the fault often removed the refund request. The rule was arbitrary in the sense that either choice was defensible, but it was consistent, which is what mattered.
Second, they ran a pilot. Three people re-labeled the same two hundred tickets independently. Initial agreement was poor, exactly as feared, so they calibrated, updated the guidelines with every disputed case, and ran a second pilot. Agreement climbed to a comfortable level.
Third, they re-labeled the full training set against the new guidelines, seeding gold examples to keep annotators consistent. Finally, they ran a cold audit and documented the new accuracy of the labels themselves before retraining.
The Outcome
With clean, consistent labels and no change to the model architecture at all, accuracy jumped well past the old ceiling on the very first retrain. The refund-versus-technical confusion, which had dominated the error log, nearly vanished. Response-time routing finally delivered the improvement the project had promised months earlier.
The most telling detail was where the improvement came from. The model's performance on the general-question bucket, which had never been the problem, barely changed. Almost the entire gain came from the two categories whose overlap had been resolved. That precise correspondence, fix the data for two categories and watch exactly those two categories improve, was the final confirmation that the diagnosis had been right. Model-capacity fixes do not produce that kind of surgical, localized improvement; data fixes do.
The deeper win was cultural. The team adopted the cold-audit habit before every retrain and never again spent a sprint tuning a model that was being starved of a coherent signal. The reasoning behind that habit is laid out in Labeling Habits That Separate Good Datasets From Lucky Ones, and the foundational why is in Why Your Model Is Only as Smart as Its Labels.
They also changed the order of their debugging instincts permanently. Before this episode, a stalled model triggered a hunt through architectures and features. Afterward, the first move was always to pull a random sample and re-label it blind, because that one cheap check would have saved them two sprints. The episode became the team's standard cautionary tale, retold to every new hire as a reminder to look at the data before blaming the model.
What the Story Generalizes To
Strip away the support-ticket specifics and the arc is universal. A competent team built a reasonable model, hit a wall, attacked the wall with modeling tools, and made no progress, because the wall was made of data, not architecture. The lesson is not "support tickets are tricky." It is that a model's accuracy is capped by the consistency of its labels, and no modeling technique raises a cap set by contradictory training data.
The practical instruction that falls out of this is cheap and almost universally skipped: when a model plateaus inexplicably, pull a random sample and re-label it blind before touching the architecture. That single check, which costs an afternoon, would have saved this team two sprints. It is the first thing to try, not the last, and the reason it works is that it directly measures the one thing every model depends on and nobody routinely inspects.
Frequently Asked Questions
How did the team finally realize labels were the problem?
An engineer re-labeled a random sample by hand and compared against the existing labels. The roughly one-third disagreement rate made the contradiction undeniable. That cold-audit move is the single most useful diagnostic when a model inexplicably plateaus.
Why did adding more data not help?
Because the new data carried the same contradictory schema. Adding more contradictory examples cannot teach a coherent rule; it just reinforces the confusion. When more data stops helping, suspect the signal, not the quantity.
Was the refund-versus-technical rule objectively correct?
No, and that is the point. Either choice was defensible. What mattered was picking one and applying it consistently, because the model can learn any consistent rule but cannot learn a contradiction.
Could better modeling have salvaged the original labels?
No. A contradictory signal has an accuracy ceiling no architecture can break. The two sprints of modeling work were doomed before they started, which is why diagnosing the data first saves enormous effort.
What habit did the team keep afterward?
A cold audit before every retrain: pull a random sample, label it blind, and compare. It catches schema rot and drift early and prevents the exact trap that cost them two sprints.
Key Takeaways
- When more data and bigger models stop helping, suspect the labels, not the architecture.
- A cold blind re-label of a random sample is the fastest way to expose contradictory data.
- Schema holes on overlapping categories quietly cap a model's accuracy.
- Any consistent rule beats an objectively "correct" but inconsistently applied one.
- Fixing labels alone broke the ceiling with zero changes to the model.