3 min read

Prompt engineering is engineering

The dismissal is wrong. Here's why prompt engineering is a real engineering discipline.

The dismissal goes like this: prompt engineering isn't real engineering, it's just talking to a model, anyone can do it, it'll be automated away. I've heard versions of this from engineers who write code for a living and should know better.

Here's the thing. Every critique they're making of prompt engineering applies equally to software engineering if you squint.

The same failure modes

Prompts fail for the same reasons code fails: ambiguous requirements, edge cases not handled, implicit assumptions that aren't written down, interactions between components that weren't anticipated.

A prompt that works on the examples you tested it on and fails on the examples you didn't is a prompt with insufficient test coverage. Sound familiar? The fix is the same: expand your test set, find the failure cases, handle them explicitly.

A prompt that works today and breaks when the model updates is a prompt with an undocumented dependency on model behavior. Same as code that breaks when a library updates because it was relying on undocumented behavior.

A prompt that produces correct output on simple inputs and wrong output on complex ones is a prompt that doesn't scale. Same as an algorithm that works on small inputs and fails on large ones.

The same disciplines

You wouldn't ship code without tests. You shouldn't ship prompts without evaluation sets. We maintain a test suite for our production prompts the same way we maintain test suites for production code. New prompt changes go through CI. Regressions block deployment.

You wouldn't commit code without version control. Same for prompts. Every production prompt is versioned. Every change has a diff. Every deployment is traceable.

You wouldn't hardcode values in code that need to change. Same for prompts. The parts of a prompt that are task-specific are parameterized. The parts that are fixed are constants. The structure is explicit, not implicit.

Where prompt engineering is harder

The failure surface is larger and harder to enumerate. Code either compiles or it doesn't. A prompt either produces output that looks right or it doesn't. The second category is much larger. You can have a prompt that produces correct output on 95% of inputs and wrong output on the 5% that happen to matter most. Finding that 5% requires adversarial testing that most engineering teams don't do.

The failure modes are probabilistic. Code, given the same input, produces the same output. A prompt, given the same input to a non-deterministic model, produces output that varies. Testing for this requires running the same prompt multiple times and evaluating the distribution of outputs.

The model is a dependency you don't control. When OpenAI updates GPT-4, your prompts may behave differently. You find out in production. Software engineers complain when library maintainers break APIs without warning. Prompt engineers deal with this constantly.

The actual skill

Decomposing a task into parts the model can handle reliably. Specifying output format in a way that's both precise and robust to variation. Writing few-shot examples that illustrate the pattern without overfitting to the examples. Handling edge cases explicitly rather than hoping the model generalizes correctly.

These are design and specification skills. They're the same skills that make software engineers good at their work applied to a different medium.

The engineers who dismiss prompt engineering as non-engineering are usually the ones who haven't done it at production scale. At production scale, it looks exactly like engineering. It just doesn't compile.

With gusto, Fatih.