Six weeks of GPT-4 in our workflow: what changed and what didn't

We got early API access to GPT-4 in late November. Six weeks in. Here's the honest account.

What changed

Code review. We use GPT-4 to review pull requests on the backend services. Not to replace human review. As a first pass that catches things humans miss when they're moving fast: type inconsistencies, obvious logic errors, functions that are doing too much. The output quality is high enough that it's caught real bugs before human review. We kept it.

Documentation. First drafts of internal documentation now start with GPT-4 output that a human edits. The model writes coherent technical documentation from context we provide: function signatures, a description of what the system does, example inputs and outputs. The first draft is about 70% usable. The editing time is less than writing from scratch. We kept this too.

Incident summaries. When something breaks, we write a summary for the postmortem. GPT-4 drafts these from the incident timeline and chat logs. Same 70% usable figure. The summary writing part of a postmortem was always the part nobody wanted to do. This makes it fast enough that it actually happens.

What didn't change

Architecture decisions. We tried using GPT-4 as a sounding board for system design questions. It gives coherent, plausible answers that are often subtly wrong in ways that aren't obvious until you've thought through the problem yourself. It's good at describing general patterns. It's not reliable for the specific constraints of our system. We stopped using it here.

Debugging. The model explains errors confidently and incorrectly often enough that using it for debugging costs more time than it saves. The exception is standard library errors or framework-specific issues with clear documentation. For those, it's faster than searching. For anything involving our codebase specifically, it's not reliable.

Customer-facing content. We tested using it for client communications. The tone was consistently wrong: too formal, too hedged, occasionally presumptuous. The editing cost was comparable to writing from scratch.

The pattern

GPT-4 is reliable where the task is well-defined, the output is verifiable, and the cost of a wrong answer is low (because the human is reviewing it anyway). It's unreliable where the task requires specific knowledge of your system, or where a plausible-sounding wrong answer is more dangerous than no answer.

The mistake I see other teams making: using it in the second category based on performance in the first.

On the model quality jump from GPT-3.5

It's real. The reasoning on complex prompts is noticeably better. The instruction following is more consistent. The context handling is better. The jump from GPT-3 to GPT-3.5 was incremental. The jump from GPT-3.5 to GPT-4 is larger.

Whether it justifies the price difference depends entirely on your use case. For the workflows we kept it in, yes.

With gusto, Fatih.