GPT-3 and the moment the ceiling got removed

OpenAI released the GPT-3 API in beta last week. I've been in the access queue for a few weeks and got in early. Here's what it actually does.

The jump from GPT-2

GPT-2 was impressive and limited. The 1.5B parameter model generated coherent text for a few hundred tokens and then lost the thread. You could see the capability. You could also see the ceiling.

GPT-3 is 175 billion parameters. That's not a scaling increment. That's a different class of model. The outputs are qualitatively different in a way that's hard to describe without seeing it directly. The coherence extends across longer passages. The model picks up on patterns in context that GPT-2 couldn't. And the few-shot behavior is new: you can give it examples of a task in the prompt and it performs the task without any fine-tuning.

That last part is the thing I keep returning to.

What few-shot learning actually means

With GPT-2 or BERT, if you want the model to do a specific task, you fine-tune. Collect labeled examples, run the fine-tuning process, evaluate, iterate. For a new domain or a new task structure, you start over. The pipeline from "I want the model to do X" to "the model does X" involves meaningful work.

GPT-3 changes the economics. For many tasks, you can describe the task in the prompt with a few examples and the model performs it. Not perfectly. Not on hard tasks. But well enough that for a large class of applications, the bottleneck shifts from "train a model" to "write a good prompt."

I've been running experiments on pitch analysis tasks we built models for at Zillion Pitches. The specificity detection work that took months of labeled data and iteration: GPT-3 does a reasonable version of it with five examples in the prompt. Not better than our trained model. Fast enough to build with.

What it doesn't do

Reason. The model generates plausible continuations of its input. It doesn't understand the content. You can see this with math problems, with causal questions, with anything that requires tracking state across a long context. It produces confident-sounding text that is wrong in ways a human would immediately catch.

This distinction matters for anyone building products on top of it. GPT-3 is not a reasoning engine. It's a very good text completion system with emergent abilities that look like reasoning in some cases and fail visibly in others. Knowing which cases are which requires testing on your specific domain.

The hallucination problem is real. Ask it a factual question and it generates a fluent, authoritative answer that may be completely wrong. For any application where accuracy matters, you need a verification layer. The model itself has no mechanism to express genuine uncertainty.

What I think this means

The barrier to building NLP applications just dropped significantly. The types of applications that required ML expertise, labeled datasets, and training infrastructure a year ago can now be prototyped with a prompt and an API call. That's a real shift.

The second-order effect is that the work moves up the stack. If the model layer is accessible, the differentiation moves to problem definition, prompt engineering, product design, and domain knowledge. The people who build the most useful things with GPT-3 will be the ones who understand the domain problem best, not necessarily the ones who understand transformers best.

We're at the beginning of figuring out what that means. The API has been in beta for a week. The use cases we know about are a fraction of what people will build once access is broad.

This is different from GPT-2. The ceiling moved.

With gusto, Fatih.