Once you can train a detector that scores well on a clean test set, you discover that the interesting problems are only beginning. The model that hits ninety mAP in the lab starts missing tiny objects in the corner of the frame, drawing three overlapping boxes around one car, and silently degrading two months after deployment as the world shifts beneath it. These are not beginner mistakes. They are the structural hard parts of the field, and handling them is what separates a practitioner from a tutorial-follower.
This piece assumes you already understand how AI detects objects in images at the level of bounding boxes, anchors, IoU, and the standard metrics. We are going past that, into the edge cases and expert nuances that determine whether a model survives contact with reality.
If any of the fundamentals feel shaky, our object detection trade-offs guide is the right place to firm them up first. Otherwise, let us get into the hard parts.
The Small-Object Problem
Detecting small objects is one of the most persistent failures in the field, and it has a structural cause. As an image passes through a deep network, spatial resolution shrinks layer by layer. By the time the network reaches the deep, semantically rich features, a small object may occupy a single pixel or vanish entirely. The model literally cannot see what is no longer in the feature map.
Techniques that help
- Feature pyramids. Combining high-resolution shallow features with semantically rich deep features lets the model detect at multiple scales, which is the standard answer to small objects.
- Higher input resolution. Feeding larger images preserves small-object detail, at a direct cost in compute and latency that you must budget for.
- Tiling. Splitting a large image into overlapping crops, detecting in each, and merging results effectively magnifies small objects, at the price of more inference passes.
None of these is free, and choosing among them is a trade-off against the latency budget. The way to know whether the cure worked is rigorous per-class, per-scale measurement, which our metrics guide details.
Taming Non-Maximum Suppression
When a detector fires multiple overlapping boxes for one object, non-maximum suppression (NMS) is the cleanup step that keeps the best box and discards the rest. It is also a frequent source of subtle bugs.
Where NMS goes wrong
In crowded scenes, where real objects genuinely overlap, standard NMS can suppress a correct box for a second object because it overlaps the first. The result is missed detections precisely where density is highest, the worst place to fail. Softer variants of NMS that decay rather than delete overlapping boxes mitigate this, and modern transformer-based detectors sidestep the problem by removing NMS from the pipeline entirely. If you see your model failing in crowds while scoring well on sparse scenes, NMS tuning is the first thing to inspect.
Surviving Distribution Drift
A detector is a snapshot of the world as it looked in your training data. The world keeps moving. New product packaging, seasonal lighting, a relocated camera, a different population of inputs, any of these can quietly erode performance long after deployment.
Building drift resistance
- Monitor live performance. Track production metrics continuously so you detect decline before users report it, not after.
- Maintain a refresh loop. Periodically sample real production images, label the hard ones, and retrain. Drift is not a one-time fix; it is an ongoing operation.
- Watch the confidence distribution. A shifting distribution of confidence scores is often an early warning of drift, visible before accuracy formally drops.
This operational discipline is where most detection projects actually fail, not in the modeling. The full list of these traps appears in our breakdown of common object detection mistakes.
Squeezing Accuracy From Hard Data
When easy gains are exhausted, advanced practitioners turn to the data and the loss.
Hard-example mining
Rather than training equally on all examples, focus the model's attention on the cases it gets wrong. Loss functions designed to down-weight easy, abundant background and up-weight hard, rare objects directly address the extreme imbalance between foreground and background that plagues detection. This is often a larger lever than any architecture change.
Test-time augmentation
For applications where accuracy matters more than latency, running the model on several transformed versions of an image, flipped, scaled, cropped, and merging the results squeezes out extra accuracy. It multiplies inference cost, so reserve it for offline or high-stakes settings rather than real-time pipelines.
Targeted synthetic data
For rare classes you cannot capture enough of in the wild, generating synthetic examples with automatic labels attacks the scarcity directly. Validate the gain on real held-out data, since models can learn to exploit synthetic artifacts that do not exist in reality.
Frequently Asked Questions
Why does my model detect large objects well but miss small ones?
Because deep networks progressively lose spatial resolution, small objects can shrink to a single pixel or disappear in the deeper feature maps where the model does its richest reasoning. Feature pyramids, higher input resolution, and image tiling all address this by preserving or restoring small-object detail, each at a cost in compute that you trade against your latency budget.
When should I move away from standard non-maximum suppression?
When your model performs well on sparse scenes but misses objects in crowded ones. Standard NMS can suppress a correct box for a second object that legitimately overlaps a first, causing failures exactly where density is highest. Softer NMS variants or NMS-free transformer detectors handle crowded scenes far better and are worth adopting in those cases.
How do I know if my deployed model is drifting?
Watch live performance metrics and the distribution of confidence scores over time. A gradual decline in accuracy or a shifting confidence distribution signals that production data has moved away from your training data. Catching this requires continuous monitoring; without it, drift is usually discovered only after users complain, which is far too late.
Is more data always the answer to poor accuracy?
Not always; the right data usually beats more data. Focusing on hard examples through targeted mining and loss design, addressing the foreground-background imbalance, and adding synthetic examples for genuinely rare classes often yields larger gains than indiscriminately labeling more easy images. Diagnose what the model fails on first, then collect or generate exactly that.
Key Takeaways
- Small-object detection fails structurally because deep networks lose spatial resolution; feature pyramids, higher resolution, and tiling are the standard remedies.
- Non-maximum suppression silently breaks in crowded scenes; soft NMS variants and NMS-free transformer detectors fix it.
- Distribution drift is an operational problem, not a modeling one; monitor live metrics and run a continuous label-and-retrain loop.
- Hard-example mining and imbalance-aware loss functions are often bigger accuracy levers than changing the architecture.
- Reserve test-time augmentation for offline use, and validate synthetic data against real held-out images before trusting it.