IBM Watson: the gap between the demo and the integration

The demo is always good. That's not an accusation. IBM has put serious effort into the demos. The speech-to-text one streams cleanly. The tone analyzer returns confident results with one curl call. The NLU pipeline pulls entities, keywords, and sentiment from a paragraph of text in under a second. You watch the demo and think: yes, this is the right tool. Then you go integrate it.

We're building Zillion Pitches. The core feature is pitch review: founders record a short video pitch, we run analysis, and they get feedback on what's working and what isn't. For that to work we need speech-to-text to get the transcript, tone analysis to understand delivery, NLU to pull out the structural elements. Watson had all three. IBM was pushing it hard. We decided to use it.

Three months in, here's what I've learned.

The authentication situation is a mess.

IBM is partway through migrating from service credentials to IAM. Whether you get one or the other depends on when you created your Bluemix account, which region you're in, and which service you're provisioning. The docs describe both flows. The SDK handles both. What the docs don't tell you is that if you're using service credentials for Watson Speech and IAM for Watson NLU in the same app, you're going to spend a day figuring out why one request is returning 401 and the other isn't. Nothing is wrong with your code. The problem is you're using two different auth patterns without knowing it.

The rate limits are underdocumented.

There are tier limits, service limits, and then there are what I can only describe as soft limits that aren't documented anywhere but start triggering at certain volumes. We hit them at around 200 concurrent sessions during a demo event. Support confirmed they exist. The documentation has since been updated. It wasn't when we ran into them.

The Tone Analyzer isn't a sentiment analysis tool.

This is a category error I made and I should have known better. The Tone Analyzer returns seven scores: anger, fear, joy, sadness, analytical, confident, tentative. It's designed for written text. It was trained on written text. When you run it on a speech-to-text transcript, you're running a written-language model on what is essentially a transcript artifact. Filler words change the scores. Disfluencies change the scores. A founder who says "um, yeah, I think we're, uh, seeing good traction" reads as more tentative than they actually are. We built a correction layer on top of it. We probably should have used NLU sentiment from the start.

Latency is higher than the demo suggests.

The demo runs on IBM's internal infrastructure with demo keys that presumably have priority queuing. Our production keys do not. Speech-to-text for a three-minute pitch takes, depending on load, between 8 and 22 seconds. That's fine. It's a background job. What's not fine is building a UX that implies near-realtime feedback and then having to redesign the whole thing when you realize your p95 latency is 22 seconds.

The NLU is actually good.

I want to be fair here. The Natural Language Understanding endpoint for entity extraction and keyword detection works well. For pitch content, it pulls out company names, product terms, market categories. The sentiment on document-level text is reliable enough to correlate with the feedback we get from human reviewers. This is the part of Watson I'd use again.

What I'd do differently.

Use Watson NLU for text analysis. Use a different provider for STT. We've looked at Google Cloud Speech and it's faster, the transcription accuracy is better on accented speech (which matters when your users are founders from 40 countries), and the auth is one IAM credential that works everywhere in the Google stack. The Watson STT accuracy on standard American English is fine. On non-native speakers, it starts to fall apart in ways the benchmarks don't show.

IBM has put a lot of money into making Watson the enterprise AI brand. The product does what it says. The developer experience doesn't match the marketing. The gap between the demo and the integration is real, and nobody talks about it because most people who run into it are enterprise customers with IBM account reps who fix the problems for them.

We don't have an IBM account rep. We have Bluemix console access, a support ticket queue, and the knowledge that if something breaks at 2am, we're on our own.

Worth knowing before you build on it.

With gusto, Fatih.