NLP in 2018: mostly heuristics with a transformer on top

BERT dropped in October. If you're doing NLP and haven't read the paper yet, read it this week. It's going to matter.

That said, most production NLP in 2018 looks nothing like the research frontline, and I think it's worth writing down what the field actually is versus what the papers suggest it is.

What most production NLP looks like

Regular expressions. Not as a joke. As a load-bearing part of the system.

Information extraction pipelines that are mostly hand-built rules with a named entity recognition model somewhere in the middle. The model handles the general case. The rules handle the cases the model gets wrong that matter most to the product. The rules accumulate. Nobody deletes them.

Sentiment analysis that was calibrated on review data being used on a completely different text domain with a correction layer nobody fully understands on top.

Keyword extraction that is 30% TF-IDF, 40% domain-specific stop-word lists, and 30% someone's intuition about what matters, frozen into a config file two years ago.

I'm not criticizing any of this. It works. The heuristics encode real knowledge. The question is how long it works as the field moves.

What changed in 2018

The ELMo paper in February showed that contextual word representations work better than static ones. Word2Vec gives you one vector per word regardless of context. ELMo gives you representations that shift based on surrounding words. "Pitch" means something different in a baseball context than in a sales context. Models that know this outperform models that don't.

The ULMFiT paper showed that transfer learning, standard practice in computer vision since about 2014, works for NLP too. You pre-train a language model on a large general corpus, then fine-tune on your domain data with relatively few examples. The fine-tuned model beats a model trained from scratch on your domain data even when you have more of it. This is a significant shift. It means domain adaptation gets cheaper.

Then BERT. Bidirectional transformers pre-trained on masked language modeling and next-sentence prediction. Fine-tuned on 11 NLP benchmarks and beating previous state of the art on all of them. It's not incremental. It's a different class of model.

What this means for production systems in 2019

The rules aren't going away immediately. The BERT fine-tuning workflow requires labeled data and infrastructure that most small teams don't have yet. The practical production stack for a startup doing NLP in early 2019 will still have a lot of hand-built logic in it.

But the ceiling just moved. The argument that you can't do good NLP without massive training data just got weaker. The argument that domain-specific models require domain-specific pre-training got weaker. The number of cases where the right answer is "fine-tune BERT" rather than "build more rules" is going to increase quickly.

We're evaluating it for Zillion Pitches now. The Watson NLU endpoints work and we've tuned them. The question is whether BERT fine-tuned on our corpus of pitch transcripts beats what we have. My guess is yes, on entity extraction. Less clear on sentiment where our custom rules are doing specific work.

The gap between research and production

The gap is about 18 months in both directions. Production systems are running techniques from 18 months ago because it takes time to evaluate, integrate, and stabilize new approaches. Research papers are reporting results on benchmarks that are 18 months ahead of what you can reliably deploy.

This isn't a failure. It's the normal shape of applied ML work. The useful skill is knowing which research is directionally right and will land in production, versus which is interesting but won't make it through the integration cost.

ELMo and BERT are going to land in production. Probably by late 2019, definitely by 2020, there will be usable fine-tuning tooling that makes these accessible without a research team.

The heuristics will still be there underneath. They always are.

With gusto, Fatih.