4 min read

Building for millions of requests: the architecture decisions that held

When request volume stopped being a number we celebrated and started being a number we engineered for.

We crossed a threshold this quarter where the volume of authentication requests stopped being a number we celebrated and started being a number we engineered for. Here's what the architecture looks like and which decisions turned out to be right.

The early decisions

We made several architecture choices in the first few months that were either correct or lucky. Distinguishing between the two requires honesty.

Async processing from day one. Authentication requests go into a queue. Workers process them. Results are stored and returned via webhook or polling. We did this early because the inference latency on edge hardware is variable and synchronous request-response would have produced unpredictable API response times. The async architecture means the API response time is fast (queue acknowledgment) and the inference time is independent.

This turned out to be the right decision. As volume scaled, we could scale workers independently of the API layer. We could add capacity where the bottleneck was without changing the interfaces. We could handle burst traffic by queue depth growing rather than requests failing.

Stateless workers. Each inference worker loads the model and processes requests independently. No shared state between workers. This made horizontal scaling trivial: add a worker, it comes up, it starts processing. Remove a worker, nothing breaks.

The decisions that required fixing

Model versioning was an afterthought. In the early months we deployed model updates by replacing the model file on the workers. This was fine at low volume and caused incidents at scale. When you have dozens of workers, a rolling model update that isn't coordinated produces a window where different workers are running different model versions. The inconsistency is usually harmless and occasionally produces noticeable output differences that are hard to debug.

We rebuilt the model serving layer to treat model versions as explicit deployable artifacts with gradual rollout. It took a sprint to do it and we should have done it earlier.

The database schema for storing results was designed for the original data volume. The queries that were fast at 10,000 requests per day became slow at 1,000,000. The schema wasn't wrong. The indexes weren't right. Fixing indexes under load requires care.

What scale reveals

Edge cases that exist at low volume but are rare enough to ignore become common at high volume. Input images that are corrupted, truncated, or in unexpected formats: at low volume you handle these gracefully and they're rare enough not to matter. At high volume, the error handling path is a significant portion of your traffic and it has to be as robust as the happy path.

Monitoring that's sufficient at low volume becomes insufficient at scale. When you're processing hundreds of requests per day you can look at logs. When you're processing hundreds of thousands, you need aggregated metrics, anomaly detection, and alerting. We built the monitoring infrastructure later than we should have and ran blind for longer than was comfortable.

What held without modification

The YOLO inference pipeline. The model runs on edge hardware with consistent latency characteristics. We've scaled the number of deployments significantly and the per-unit inference performance has been stable. The core technical bet has held.

The edge-first deployment architecture. Running inference at the deployment site rather than round-tripping to a cloud endpoint is the right call for logistics. The latency argument was clear from the start and the production data confirms it.

The honest scorecard

Right from the start: async processing, stateless workers, edge inference. Should have done earlier: model versioning, schema design for scale, monitoring infrastructure. Still working on: a cleaner multi-tenant model management system, better tooling for per-deployment model configuration.

The architecture that gets you to a million requests isn't the architecture you designed for a thousand. Some of the work is predictable and you can do it early. Some of it you genuinely can't anticipate. The skill is recognizing quickly which category a problem falls into.

With gusto, Fatih.