4 min read

Sentiment analysis in production: the things the papers don't mention

The paper says 92% accuracy. The production system says something different.

The paper says 92% accuracy on SST-2. The production system says something different.

I've been running sentiment analysis in a real product for about eight months. Watson NLU for most of it, with some custom preprocessing on top. Here's what the papers don't prepare you for.

Accuracy is the wrong metric to optimize for

On a balanced test set with clean text, 92% sounds useful. In production, the distribution is not balanced and the text is not clean. Founders who pitch optimistic products use optimistic language regardless of whether their pitch is good. The baseline positive rate in our corpus is around 73%. On a corpus that's 73% positive, a model that labels everything positive gets 73% accuracy and is completely useless. Precision and recall on the minority class are the numbers that matter. They're rarely what the paper is reporting.

Domain shift is real and constant

The Watson model was trained on general web text, reviews, news articles. Pitch transcripts are a specific register: informal speech, optimistic framing, technical vocabulary, compressed timelines. Words that are positive in general English are neutral in pitch language ("exciting" appears in nearly every pitch, which makes it a meaningless signal). Words that are neutral in general English are negative in pitch language ("concern" appearing in a pitch context usually means the founder is pre-empting an objection, which is actually a good sign).

We adapted. We built a term weighting layer that adjusts scores based on domain-specific priors. It took about three months of iteration to stop making the pitch-language errors. Domain shift is not a one-time calibration. As our user base expanded, the language shifted again. Product founders talk differently than deep tech founders. US founders talk differently than European ones. Each cohort requires recalibration.

The latency-accuracy tradeoff is underdocumented

Watson NLU at the free tier has latency that makes real-time analysis impossible. At the standard tier, p50 is acceptable and p95 is not. We moved sentiment to a background job. That was the right call architecturally, but it changed the product. Features that felt natural when analysis was instant feel clunky when there's a 15-second delay. The latency doesn't just affect performance. It affects what you can build.

We looked at running a local model to get latency back. The options in mid-2018 are: VADER (rule-based, fast, not great on domain-specific text), TextBlob (similar), or fine-tuning a model ourselves on our corpus. We're fine-tuning now. The accuracy is better than Watson on our data. The infrastructure cost is higher. The latency is 200ms locally versus 800ms to Watson. For our use case, the local model wins.

Negation is still an unsolved problem

"We're not struggling to find customers" should score positive. Most models score it negative because "struggling" is a negative word. Watson handles simple negation reasonably well and fails on anything involving double negatives, sarcasm, or domain-specific constructions. "We don't have a problem with retention" is the kind of sentence that breaks things. Founders use constructions like this constantly because they're pre-empting concerns.

We added a negation detection pass. It's a set of rules, not a model. Rule-based negation handling outperformed every ML approach we tried on our corpus. This is embarrassing to admit in 2018 when everything is supposed to be end-to-end deep learning, but it's true. Sometimes the right answer is a list of patterns.

What I use now

For document-level sentiment on clean text: Watson NLU is fine. For sentence-level on speech transcripts: fine-tuned model locally. For anything requiring negation handling: the rules layer on top of either.

The thing I've learned about production NLP is that the architecture you end up with looks nothing like what you designed. It has three layers where you expected one. It has rule-based components you added to fix edge cases that became load-bearing. It has preprocessing steps that are more important than the model itself.

The papers show you the model. They don't show you the twelve other things you need to make the model work.

With gusto, Fatih.