Not the architecture. The judgment. The things you can only learn by being on call when it breaks.
The pattern of production incidents
CV systems fail in predictable ways once you've seen enough of them. The distribution shifts. The model doesn't. A seasonal change in packaging, a supplier switch that slightly changes label materials, a camera firmware update that shifts the color space: all of these produce silent degradation in model performance before any alert fires.
Silent degradation is the dangerous failure mode. The system keeps running. The accuracy drops from 99.6% to 95%. Nobody notices until a client spots an anomaly in their monthly report. By that time, weeks of production data have been affected.
The lesson took a year to learn. We now run distribution monitoring on the input data, not just the model outputs. When the feature distribution of incoming images shifts outside the training distribution by more than a threshold, we alert. This fires before accuracy drops, not after.
The thing about edge cases
At high volume, the long tail is not negligible. The inputs your model was never trained on, the damage patterns that don't exist in your training set, the combinations of factors that make no individual image unusual but that together produce a confident wrong prediction: these are rare per request and common per month.
The discipline is continuous evaluation, not periodic. We have a weekly review of the lowest-confidence correct predictions and the false positives. Every week. The review takes 30 minutes. It has caught more impending production issues than any automated monitoring.
What clients actually care about after year three
In the first year, clients evaluate whether the system works. In year two, they evaluate whether it integrates into their processes. By year three, the baseline questions are settled. What clients care about by then is consistency and transparency.
Consistency: the system should behave the same in January and July. Changes in behavior need to be communicated in advance, not discovered. Every model update is now preceded by a client notification describing what changed and what to expect.
Transparency: when something is flagged, the client wants to understand why. Not a confidence score. An explanation. Building the anomaly explanation feature that I've been thinking about since 2023 is now urgent. The clients who have been with us longest are asking for it most directly.
What I'd tell someone starting a CV production system
Build the monitoring before you need it. The period between "the model is trained" and "the model is in production" is when your attention is on getting it deployed. That's also the last window where you can build monitoring infrastructure without the pressure of a live system depending on you.
Plan for model updates as a product change, not a technical change. A model update changes the system's behavior. Clients experience the behavior. Treat model updates the way you'd treat a feature change: communicate them, provide context, give clients time to adapt.
The three-year model is the one you've accumulated three years of production feedback on. It's not the same model you started with and it's much better. Trust the process of continuous improvement more than any single architectural decision.
With gusto, Fatih.