AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Why a Single Accuracy Score Misleads YouPrecision and recall, kept apartIntersection over Union, the spatial checkThe Metrics That Actually MatterMean Average PrecisionPer-class precision and recallLatency and throughputConfusion at the class levelThe F1 score, used carefullyHow to Instrument Them HonestlyA Worked Example: The Model That Looked GreatDecomposing the numberReading the Signal When Numbers DisagreeFrequently Asked QuestionsWhat is a good mAP score?Should I optimize for precision or recall?Why does my model score well in testing but poorly in production?Is IoU threshold something I should change?Key Takeaways
Home/Blog/Your Detector Says 92%. Ninety-Two Percent of What?
General

Your Detector Says 92%. Ninety-Two Percent of What?

A

Agency Script Editorial

Editorial Team

·October 22, 2023·8 min read
how ai detects objects in imageshow ai detects objects in images metricshow ai detects objects in images guideai fundamentals

Someone tells you their detection model is ninety-two percent accurate, and your first question should be: ninety-two percent of what? Accuracy, as most people use the word, is almost meaningless for object detection. A model can score high simply by finding the easy objects and quietly missing the rare, small, or critical ones. The number on the slide is not lying, exactly. It is just answering a question nobody should be asking.

Understanding how AI detects objects in images is only half the job. The other half is measuring whether that detection is good enough to trust, and that requires a vocabulary most teams never learn properly. The gap between a model that demos well and a model that holds up in production is almost always a gap in measurement.

This guide defines the metrics that matter, explains how to instrument them so you get an honest signal, and shows you how to read that signal when the numbers disagree with each other. The payoff is not academic. Teams that measure well catch problems in evaluation, where they are cheap to fix. Teams that measure badly catch them in production, where they are expensive and embarrassing. The difference between those two outcomes is almost entirely a difference in which numbers you choose to watch.

Why a Single Accuracy Score Misleads You

Object detection has to get two things right at once: it must find the object (did a box land on it?) and it must locate it precisely (does the box actually fit?). A plain accuracy figure collapses both into one number and usually ignores the cost of false alarms entirely. In an imbalanced setting, where most of the image is background, a model can look impressive while being functionally useless for the thing you care about.

The fix is to measure precision and recall separately, then layer in spatial quality.

Precision and recall, kept apart

Precision answers: of the boxes the model drew, how many were correct? Recall answers: of the objects that were really there, how many did the model find? These two trade off against each other. Tighten the model to avoid false positives and recall drops; loosen it to catch everything and precision falls. Which one you weight depends entirely on the use case, a point our guide to object detection trade-offs explores in depth.

Intersection over Union, the spatial check

A box can be correct in class but sloppy in placement. Intersection over Union, or IoU, measures the overlap between the predicted box and the ground-truth box. You pick an IoU threshold, often 0.5, above which a detection counts as a true positive. Raise that threshold and you demand tighter boxes, which matters enormously for robotics and measurement but less for a rough "is there a person here" alert.

The Metrics That Actually Matter

Mean Average Precision

Mean Average Precision, or mAP, is the headline metric of the field for good reason. It summarizes the precision-recall curve across every class and, in modern usage, averages across a range of IoU thresholds. A higher mAP means the model is both finding objects and placing boxes well, across the whole distribution of difficulty. When you read a benchmark, mAP is the number to anchor on, but treat it as a summary, not the whole story.

Per-class precision and recall

mAP averages away the detail you most need. A model with strong overall mAP can be quietly terrible at one critical class. Always break the metric down per class. The pedestrian recall in a self-driving stack matters far more than the overall average, and only a per-class view exposes it.

Latency and throughput

A model that is accurate but too slow fails just as surely as one that is fast but wrong. Measure inference time per image and frames per second on your actual deployment hardware, not on a benchmark rig. For real-time systems this metric is a pass-fail gate, not a nice-to-have.

Confusion at the class level

When the model does make mistakes, which classes does it confuse? A confusion matrix turns vague underperformance into a specific, fixable problem: the model keeps calling trucks cars, or mislabeling one product SKU as another. That specificity is what makes the metric actionable. A vague complaint that "the model is inaccurate" leads nowhere; a confusion matrix that shows two specific classes bleeding into each other points straight at a data fix, often a handful of clearer examples or a labeling-guideline tweak.

The F1 score, used carefully

When you need a single number that balances precision and recall, the F1 score, their harmonic mean, is more honest than plain accuracy because it punishes a model that wins on one at the expense of the other. Use it as a convenient summary, but never let it replace looking at precision and recall separately, since two very different models can share an identical F1 while failing in opposite ways.

How to Instrument Them Honestly

Good metrics come from good measurement discipline more than from clever math.

  • Hold out a representative test set. Your evaluation images must mirror real deployment conditions, including the hard lighting, occlusion, and angles your demo conveniently skipped.
  • Freeze the test set. Never tune against your final evaluation data. The moment you optimize toward it, it stops telling you the truth.
  • Track metrics over time. A model degrades as the world drifts away from its training data. Log production performance continuously so you catch the decline before users do.
  • Separate validation from test. Use validation data to tune and a sealed test set to judge. Conflating them is the fastest route to a model that looks great and fails in the field.

For a wider set of measurement habits worth adopting, our piece on object detection best practices extends these into a full workflow.

A Worked Example: The Model That Looked Great

Consider a hypothetical defect-detection model reported at ninety-one percent accuracy. On its face, impressive. Now break it apart the way this guide prescribes, and the story changes.

Decomposing the number

  • Overall precision is high, recall is low. The model rarely flags a good part as defective, but it misses a meaningful share of real defects. For quality control, that is the wrong trade-off, because an escaped defect reaches a customer.
  • Per-class detail exposes a weak spot. mAP looks healthy until you split it by defect type and find that the rarest, most serious defect class scores poorly, drowned out in the average by the common, trivial ones.
  • IoU reveals sloppy boxes. At a 0.5 threshold the model passes, but at 0.75 its score collapses, meaning its boxes are loosely placed, a problem if a downstream robot uses the box to position a tool.

None of this was visible in the headline accuracy figure. Each issue surfaced only because precision, recall, per-class mAP, and IoU were examined separately. This is the entire argument for a richer metric vocabulary in one example: the same model is either excellent or unacceptable depending on which question you ask, and only the decomposed view tells you which.

Reading the Signal When Numbers Disagree

Metrics will contradict each other, and that contradiction is information. High precision with low recall means the model is cautious and missing things; loosen the confidence threshold or gather more examples of the missed class. Low precision with high recall means it is trigger-happy; tighten the threshold or hunt down the false-positive sources. Strong mAP but poor per-class numbers on a critical category means your average is hiding a failure you cannot afford. The real-world consequences of misreading these signals show up vividly in our collection of object detection use cases.

Frequently Asked Questions

What is a good mAP score?

There is no universal threshold. mAP is only meaningful relative to your dataset, your IoU settings, and your task. A score that is excellent for a cluttered outdoor scene might be poor for clean product photos. Compare against a baseline on your own data and against the accuracy your use case actually requires, never against a number from an unrelated benchmark.

Should I optimize for precision or recall?

It depends on the cost of each error type. When a false negative is dangerous, such as missing a tumor or a pedestrian, prioritize recall. When a false positive is costly or annoying, such as a security system that cries wolf, prioritize precision. Most teams should set a hard floor on whichever matters most and then maximize the other.

Why does my model score well in testing but poorly in production?

Almost always because the test set did not represent production conditions, or because the world has drifted since training. New lighting, new camera angles, or new object variants all degrade performance. The fix is a representative, regularly refreshed test set plus continuous monitoring of live metrics.

Is IoU threshold something I should change?

Yes, deliberately. A higher IoU threshold demands tighter, more accurate boxes and suits measurement, robotics, and counting tasks. A lower threshold tolerates looser boxes and suits coarse presence detection. Pick the threshold that matches how precise your downstream use of the box really needs to be.

Key Takeaways

  • A single accuracy number is misleading; measure precision and recall separately, then add spatial quality via IoU.
  • mAP is the right headline metric, but always break it down per class to expose failures the average hides.
  • Latency and throughput on real hardware are pass-fail gates for real-time systems, not afterthoughts.
  • Instrument honestly: use a frozen, representative test set, keep validation and test separate, and monitor production metrics over time.
  • When metrics disagree, read the pattern; the contradiction tells you exactly which lever to pull next.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification