Building ML infrastructure for under EUR 100K

Enterprise computer vision at production scale does not require enterprise infrastructure budgets. Here's what we built, what it cost, and what I'd change.

The training setup

We run training on a mix of owned hardware and cloud spot instances. The owned hardware is a small GPU cluster, four A100s, that we've had since mid-2021. The capital cost was significant and it was the right decision. For a company that trains models continuously, the per-computation cost of owned hardware beats cloud at any reasonable utilization rate above about 40%. We're well above that.

The cloud supplement is for burst capacity: large retraining runs when we update the base architecture, evaluation runs across the full deployment dataset suite, and experiments that need to run in parallel. We use spot instances and design training jobs to be resumable from checkpoints. Spot interruptions are frequent. Building for them is non-optional.

The full training infrastructure, including the A100 cluster purchase amortized over three years, runs well under EUR 100K annually.

The inference setup

We run on edge hardware at deployment sites. The inference cost is paid by clients as part of the deployment hardware. The question for us is the spec: what hardware is sufficient for our model to meet the latency requirements.

We've standardized on NVIDIA Jetson series for most deployments. The Jetson AGX Xavier is our current standard for high-throughput sites. The Jetson NX covers lower-volume deployments. The selection criteria is inference latency at the site's throughput requirement, not raw compute specs.

This standardization matters. A diverse hardware fleet is an operational nightmare. Every hardware variant requires model optimization, testing, and deployment tooling. Standardizing on two SKUs was a deliberate decision that reduced operational overhead significantly.

The data infrastructure

Training data lives in cloud object storage with a local cache on the training machines. The pipeline from raw images to training-ready format runs on a small Kubernetes cluster that we also use for the API layer. Total compute cost for the data pipeline is minimal.

The labeling infrastructure is the expensive part. We use a combination of internal tooling and external labeling services. The internal tooling handles the structured part of labeling where we have clear rules. The external services handle volume when we have large batches of new data. Labeling costs run roughly EUR 0.10 to EUR 0.30 per image depending on complexity. At the volumes we work with, this is the largest variable cost in the infrastructure stack.

What I'd change

The A100 cluster purchase timing. We bought before we fully understood our training patterns. In retrospect, staying on cloud for longer and buying owned hardware once we had 12 months of utilization data would have resulted in a better hardware spec for our actual workload.

The labeling tooling. We built internal tooling too early. The tooling is fine now but it cost us engineering time in a period when engineering time was the scarcest resource. We should have used off-the-shelf labeling platforms longer and built custom tooling only once the off-the-shelf limitations were actually blocking us.

The general principle

ML infrastructure costs are controllable with deliberate choices about owned versus cloud, standard versus custom hardware, and build versus buy on tooling. The teams that overspend do so because they build infrastructure for hypothetical future scale rather than actual current requirements.

Build for what you need now. The constraints that matter at 10x your current scale are not knowable from where you are. You will have more information and more resources when you get there.

With gusto, Fatih.