Getting to 99.6% accuracy in production

Not in a notebook. In a warehouse. Against images taken under fluorescent lights by a conveyor belt moving at operational speed. Here's what it took.

What 99.6% actually means

At a million requests, 99.6% means 4,000 errors. Whether that's acceptable depends entirely on the cost of each error type. A false negative (counterfeit passes as authentic) has one cost. A false positive (authentic product flagged as counterfeit) has a different one. The aggregate accuracy number obscures which kind of errors you're making.

We optimized for false positive rate first. Our enterprise clients in logistics cannot tolerate operational disruption from false alarms. An authentication system that stops the line to flag authentic products is worse than no authentication system. So the accuracy target was really a false positive constraint: keep it below a threshold where the operational cost is acceptable, then maximize true positive rate within that constraint.

The 99.6% number is the true positive rate (detecting actual counterfeits) at a false positive rate that our clients have accepted. It's not the simpler accuracy metric. Be skeptical of anyone quoting CV accuracy without specifying which metric and at what operating point.

The data work

The model at 80% was a model problem. The model at 90% was still partly a model problem. The model at 95% and above was almost entirely a data problem.

The distribution of errors at high accuracy is not random. It clusters around specific product types, specific lighting conditions, specific damage patterns. Each cluster requires either more training data for that case or explicit handling in the preprocessing pipeline.

We run a structured error review process: every week, pull the lowest-confidence correct predictions and the false positives from the previous week, categorize them, and route them to the appropriate fix. Some go back to data collection. Some go to preprocessing. Some reveal model architecture limitations. The process is boring and necessary.

The hardware environment work

Accuracy improvements in the lab don't always transfer to the field. We've had cases where a model update improved benchmark performance and degraded production performance because the benchmark didn't represent the specific conditions of a particular deployment.

The fix was building per-deployment evaluation sets. Each major client environment has an evaluation set built from images captured in that specific environment. Before a model update ships, it has to show improvement or no regression on each deployment's evaluation set, not just the general benchmark.

This is operationally expensive. It's also the thing that makes the accuracy claims credible.

What didn't work

More data without better data. We went through a phase of bulk data collection that produced large datasets with inconsistent quality. The model trained on them was worse than the model trained on fewer, higher-quality examples. Data quality dominates data quantity once you're above a certain volume threshold. We learned this expensively.

Automated data augmentation as a substitute for deployment-specific data. Augmentation is useful. It's not a substitute for actual images from the actual environment. Synthetic variation and real variation are different.

Where we are

99.6% is not the ceiling. The error analysis points to specific cases where improvement is possible. Some of those improvements are near-term. Some require hardware changes at the deployment site that are outside our control.

The number is meaningful because we know what it means and how it was measured. Numbers that don't come with measurement methodology aren't numbers. They're marketing.

With gusto, Fatih.