4 min read

Cutting operational costs 30% while growing system usage 50%

Specific decisions. Not principles.

Specific decisions. Not principles.

The context

Over the past twelve months, the number of live Wasteer installations grew from roughly twenty to thirty-plus. System usage (inferences per day, API requests, data stored) grew around 50%. Infrastructure costs grew about 5%. The gap between usage growth and cost growth is the result of several specific decisions, not general frugality.

Decision 1: model inference frequency

The most expensive inference is the one you don't need to run. Our initial architecture ran inference on every camera frame: continuous stream processing. This produced very high inference frequency and high cost.

The actual product requirement is fill-level monitoring and anomaly detection. Fill level changes slowly. A container that was 60% full at 8am is probably still around 60% full at 8:05am. We moved to triggered inference: run inference when a motion-detection heuristic detects a relevant change in the frame, plus a scheduled baseline check every N minutes.

The triggered model reduced inference frequency by about 65% with no measurable reduction in detection quality. Fill level accuracy was unchanged because fill level doesn't change at frame rate. Anomaly detection improved slightly because we were no longer flooding the model with redundant frames where nothing had changed.

Decision 2: compute right-sizing

We were running more minimum instances on Cloud Run than the traffic pattern required. The original minimum instance count was set conservatively during the scaling push. Once we had six months of production traffic data, the actual p99 latency requirements pointed to a lower minimum instance count.

Systematic right-sizing of minimum instances across all services reduced the always-on compute spend by about 20%. This took half a day once we had the traffic data to make the decision confidently.

Decision 3: storage tiering

Raw camera frames are large. We were storing more of them than we needed for longer than we needed to. The product requirement for historical raw frames was: recent frames for debugging (7 days), anomaly frames indefinitely, everything else not needed after processing.

We implemented a lifecycle policy: raw frames move to nearline storage after 7 days if not flagged, to coldline after 30 days, and are deleted after 90 days unless flagged for retention. The anomaly frames stay in standard storage indefinitely. This reduced storage costs significantly with no impact on any customer-facing feature.

Decision 4: edge model compression

The model running on edge hardware at each installation was the full-size production model. Model quantization (INT8) reduced the model size by 4x and inference time by 30% with less than 1% accuracy degradation on our evaluation set. The smaller model runs on lower-spec hardware, which reduced the hardware cost for new installations.

This decision required a week of evaluation work to validate that the accuracy degradation was acceptable. It was. It should have been done earlier.

What didn't work

Aggressive caching of inference results. We tried caching inference outputs for similar-looking frames. The cache hit rate was too low to justify the caching complexity. Camera frames vary enough between captures that similarity detection is harder than it sounds.

Spot instances for edge hardware. We explored whether spot/preemptible instances could reduce cloud costs. The inference workload is latency-sensitive enough that preemptible instances introduced unacceptable disruption. Stayed with on-demand.

The general pattern

Measure before optimizing. Every one of the successful decisions was made from production data, not from estimates. The triggered inference decision came from measuring how often consecutive frames actually differed. The right-sizing came from traffic data. The storage tiering came from measuring which frames were ever accessed after the initial processing.

Cost optimization without production data is guesswork. With it, it's engineering.

With gusto, Fatih.