Speech-to-text in production: latency, accuracy, and the tradeoffs nobody documents

We've been running speech-to-text in production for over a year now. Zillion Pitches processes founder pitch recordings, which means we have a reasonably large corpus of real-world STT results to draw conclusions from. Here's what the benchmark numbers don't tell you.

The accuracy number is not your accuracy number

Google Cloud Speech reports word error rates on clean audio with standard American English. Our corpus is founders from forty countries recording on whatever microphone their laptop has, in whatever room they happen to be in. The WER on our corpus is roughly double what the benchmark suggests. Not because the API is bad. Because benchmarks use studio audio and founders use MacBook pros in co-working spaces.

Accented English is the main driver. The STT models have clearly been trained on data that skews heavily toward native speakers. A founder from Eastern Europe or Southeast Asia with strong technical command of English but a regional accent will generate a transcript that requires significant correction. We've been tracking this and the gap between native and non-native speaker accuracy on Google Cloud Speech is large enough to matter for our product.

Amazon Transcribe launched in 2017 and we evaluated it. The accuracy on accented speech is slightly better than Google in some cases, slightly worse in others. Not meaningfully different. The developer experience is worse. The async API model doesn't fit our latency requirements.

We're still on Google.

Latency is the real constraint

Streaming STT (sending audio in chunks and receiving partial transcripts in real time) works well for short utterances. For a three-to-five minute pitch recording, streaming introduces complexity without much benefit. We use the async batch API. The p50 turnaround is around 12 seconds for a five-minute file. The p95 is around 35 seconds.

35 seconds is long enough to change how you design the user experience. We moved to a jobs model early: submit your pitch, come back when it's done. This turns out to be fine for our use case. It would not be fine if we were building a real-time transcription product.

Punctuation and formatting are underrated

The raw STT output is unpunctuated. No commas, no periods, no paragraph breaks. For downstream NLP tasks you need to add them, either via the API's automatic punctuation feature or via a separate punctuation restoration model. The Google automatic punctuation is good enough for our purposes. It fails on technical vocabulary, product names, and anything that doesn't appear frequently in its training data.

We added a post-processing step that handles the most common failures in our domain. Startup names and technical terms get mangled. "SaaS" comes out as "sas". "API" sometimes comes out as "a p I". These are fixable with a domain-specific lookup table. Not glamorous. Load-bearing.

Speaker diarization is not ready for production

We experimented with it. Speaker diarization separates audio into segments by speaker, useful when multiple people are talking. In founder pitches this doesn't come up much, but we tested it for panel recordings from Draper events. The accuracy is poor enough on overlapping speech and similar voices that we turned it off. The state of the art on diarization is about two years behind general STT.

What I'd do differently

Evaluate accuracy on your actual corpus before committing to a provider. The benchmark numbers are marketing. Run a hundred samples from your real data through each API and measure what you care about. We would have made a different early decision if we had done this upfront instead of after we were already integrated.

Build the post-processing layer on day one. It is going to exist. The only question is whether you build it before you ship or after your first set of user complaints.

With gusto, Fatih.