Speed as a product feature

Our average response time is 0.5 seconds. In the logistics industry, this is not a benchmark. It's a business requirement.

Why latency is load-bearing

A conveyor belt in a parcel sorting facility moves at a fixed rate. The authentication decision has to happen before the package reaches the divert point. The divert point is a physical location in the facility. The time between the camera trigger and the divert decision is bounded by the distance between those two points at the belt's speed.

In the facilities we work in, that window is between 400 and 800 milliseconds depending on the layout. Our p95 latency has to fit inside the minimum of that range across all deployments. 500ms average with acceptable tail latency does that. 1 second average does not.

This is a case where the performance requirement is not derived from user experience benchmarks or competitive comparison. It's derived from physics. The belt moves. The window is fixed. The system either fits inside it or it doesn't.

How we hit it

Edge inference. The alternative is sending the image to a cloud endpoint for inference and receiving the result. The round-trip latency to any cloud region from a logistics facility adds 50-150ms before the model even runs. That's before accounting for model inference time, which runs 150-300ms depending on the hardware. Cloud inference doesn't fit in the window.

Running inference on hardware at the deployment site removes the network latency. The inference time is the latency. We optimize the model for the specific edge hardware at each deployment: different quantization levels, different input resolutions, different batch sizes. The goal is consistent latency, not maximum accuracy.

Model size tradeoffs. The most accurate version of our model doesn't fit the latency budget on the standard edge hardware. We maintain model variants: a full-size model for offline batch processing and a production model optimized for edge inference. The production model trades some accuracy for the latency characteristics the deployment requires. The accuracy tradeoff is acceptable because the alternative is not deploying at all.

Where speed is still a problem

Initialization latency. When an edge device reboots, the model takes time to load into memory. During that window, the system isn't available. We've reduced this but not eliminated it. For 24/7 operations, any initialization delay is a gap in coverage.

Model update latency. Deploying a new model version requires taking the inference service down briefly. We've built a warm-swap mechanism that brings up the new version before taking down the old one, but the handoff window still exists.

Tail latency at full load. The p95 is within budget. The p99 sometimes isn't under sustained high-throughput conditions. We're working on the edge hardware configuration that causes this. It's a thermal throttling issue in long-running operations.

What this taught me about product requirements

Derived requirements are harder to negotiate than arbitrary ones. A stakeholder can be convinced to accept 600ms instead of 500ms if the latency target came from a benchmark or a competitor comparison. They can't accept it if the latency target came from the speed of light and the physical dimensions of their facility.

Understanding where a requirement comes from determines how hard it is and what the real tradeoff space looks like. Spend time on the derivation, not just the number.

With gusto, Fatih.