Benchmarks vs. production: why they tell you different things

MMLU scores don't tell you what happens when your prompt is ambiguous at 11pm and the model confidently gives you the wrong answer.

What benchmarks measure

Benchmarks measure model performance on a fixed set of tasks with correct answers. MMLU: multiple choice questions across knowledge domains. HumanEval: code completion against test suites. HellaSwag: sentence completion plausibility. The scores are useful for comparing models on the specific capability the benchmark is designed to measure.

The benchmark captures a capability. It does not capture reliability in your deployment context. Those are different things.

What production reveals

Distribution shift. The model's training data distribution and your deployment data distribution don't match. The degree of mismatch determines how much the benchmark performance degrades in production. For general knowledge tasks on standard domains, the mismatch is small. For domain-specific tasks, specialized vocabulary, or unusual prompt structures, the mismatch can be large.

Prompt sensitivity. On benchmarks, the prompts are carefully designed and held constant across models. In production, prompts come from users, from upstream systems, from code that was written six months ago. Small variations in wording produce large variations in output. The benchmark doesn't show you this.

Failure mode distribution. The benchmark tells you the error rate. It doesn't tell you what the errors look like. A model with 90% accuracy that is wrong in random, low-stakes ways is different from a model with 90% accuracy that is wrong in systematic, high-stakes ways. For many production applications, the failure mode matters more than the aggregate rate.

Latency and cost at scale. Benchmarks say nothing about these. The model that performs best on MMLU may be the model that's too expensive to run at your query volume or too slow for your latency requirements. The production choice involves a performance-cost-latency tradeoff that benchmarks don't capture.

What we do instead of trusting benchmarks

Build an evaluation set from our own data. Every time we evaluate a new model or a new version, we run it against 200-300 examples from our actual production workload with known correct outputs. The score on our evaluation set correlates with production performance. The benchmark score doesn't.

Shadow testing. New model versions run in shadow mode for two weeks before production traffic switches to them. Shadow mode: the old model serves production, the new model receives the same requests, outputs are compared but not served. Discrepancies are reviewed. Regressions are caught before they affect clients.

Error taxonomy. We categorize errors by type, not just count. A new model might have a lower error rate overall but a higher rate of a specific error type that matters for our use case. The aggregate number would hide this.

The meta-point

Benchmark scores are inputs to a model selection decision, not outputs. They tell you something about general capability. They don't tell you whether the model will work for your specific application.

The teams that run into problems with LLMs in production are often the ones that shipped based on benchmark performance without validating on their own data. The evaluation work is unglamorous. It's also the difference between a deployment that works and one that doesn't.

With gusto, Fatih.