2 min read

Azure Cognitive Services in 2017: An Honest Review From a Startup CTO

We used them in production. Here's what the docs don't prepare you for.

We've been running Trendoline on Azure for several months now and I've had time to experiment with Cognitive Services properly. These are Microsoft's pre-built AI APIs: vision, face detection, text analytics, language understanding, speech. The idea is that you get machine learning capabilities without training your own models.

That's a real value proposition. Here's where it holds and where it doesn't.

What works well

Text Analytics is solid for what it does. Sentiment analysis and key phrase extraction work reliably enough to be useful in a social app context. The API is clean, the latency is acceptable, and the results are good enough that I don't feel the need to train something custom.

Face detection also works well on decent quality images. We experimented with it for profile photo validation and it does the job.

Where the gap shows up

The accuracy numbers in the documentation are measured on benchmark datasets. Production data doesn't look like benchmark datasets. Real user-generated content is noisier, more ambiguous, and more diverse than anything a clean evaluation set captures.

Sentiment analysis on casual social media text is harder than on product reviews or news articles. Short posts, sarcasm, slang, mixed languages. The model handles some of it and misses a lot of it. That's not a criticism exactly, it's just the reality of what you're buying.

The LUIS experiment

We tried Language Understanding (LUIS) for a feature that required interpreting user intent from short text inputs. The tooling for building and training the model is good. The problem is that intent classification requires enough labeled examples to be useful, and getting those examples is real work. You don't escape the data problem by using a managed service.

The honest summary

Cognitive Services is a good option for features where you need machine learning capabilities quickly and the task is common enough that a general-purpose model performs adequately. It saves a lot of time compared to building and deploying your own models.

For anything that requires high precision on domain-specific data, you will need to either fine-tune or train something yourself. The API gets you to a working prototype fast. Getting from prototype to production-quality is still your problem.

Worth having in the toolkit. Not a replacement for understanding what the models are actually doing.

With gusto, Fatih.