The wrong choice costs you six months. I've watched several teams make it. Here's how to not.
The actual distinction
Classification answers: what is in this image? You get a label and a confidence score. The model assumes the object of interest occupies most of the frame.
Detection answers: what is in this image and where is it? You get bounding boxes with labels. The model finds objects within a larger scene. Multiple objects per image, different sizes, different positions.
Segmentation answers: what is in this image, where is it, and what are the exact pixel boundaries? You get a pixel-level mask for each object instance. More precise, more computationally expensive.
These are not a hierarchy where segmentation is always better. They're different tools for different problems.
The mistake teams make
They look at what a large vision model can do and assume they need the most sophisticated version. If the model can segment, why would you use classification? Because segmentation requires more data to train well, runs slower at inference, and adds complexity you may not need.
The question to ask is: what does my application actually need from the model output? Then choose the minimum sophistication that satisfies that need.
How to choose
Use classification if: the object of interest reliably occupies most of the image frame. Product authentication from a controlled camera setup where the product is centered. Defect detection where you've already cropped to the relevant area. Anything where image composition is controlled.
Use detection if: you need to find objects within a larger scene, handle multiple objects per image, or can't control image composition. Our logistics deployment uses detection because products arrive at angles, partially occluded, not centered. The detection step localizes the product before authentication happens.
Use segmentation if: you need pixel-level precision. Measuring surface area. Detecting defects at specific spatial locations. Any application where bounding boxes are insufficiently precise.
Where we landed at Countercheck
We use detection. The YOLO architecture localizes the product in the frame, then a classification head authenticates it. Two stages because the deployment context requires it: we can't control where the product appears in the camera frame.
An early version of the architecture used classification only, with the assumption that the camera placement would center the product reliably enough. The assumption was wrong. The real logistics environment has too much variability in product positioning. Switching to detection cost us about six weeks.
That six weeks was avoidable. If we had mapped the deployment context more carefully before committing to an architecture, the camera placement assumption would have been identified as fragile. The lesson is to stress-test assumptions about input conditions before committing to an architecture that depends on them.
The segmentation question
We evaluated segmentation for specific use cases: detecting subtle label modifications that indicate counterfeiting, where the relevant information is at sub-bounding-box scale. The accuracy improvement was real. The inference latency increase made it unusable at line speed.
We solved this differently: higher resolution camera capture for the classification step rather than segmentation. The same information, captured at input rather than extracted at inference. Same outcome, better latency characteristics.
The general principle
Choose the minimum sophistication that solves the actual problem. Then profile in your deployment environment before committing. The benchmark numbers for any of these approaches are measured on clean data under controlled conditions. Your deployment environment is neither clean nor controlled. Measure there.
The six months you lose from the wrong architecture choice isn't from implementing the wrong thing. It's from the discovery that it's wrong, the validation of the right approach, and the migration. Make the discovery before you've shipped.
With gusto, Fatih.