4 min read

AI-assisted code review: the system we built and what it actually catches

Six months of AI-assisted code review in production. The architecture, the results, and the things it still misses.

Six months of running AI-assisted code review in production. Here's the architecture, the results, and the things it still misses.

The system

Every pull request triggers a review pipeline. The pipeline pulls the diff, builds context from the surrounding code, and sends it to GPT-4 with a structured prompt. The output is a set of review comments posted to the PR alongside human reviewer comments. The AI comments are labeled as such. Human reviewers see them but aren't required to act on every one.

The context building is the part that required most of the work. A raw diff without context produces poor reviews. The model doesn't know what the function is supposed to do, what invariants the surrounding code relies on, or what changed recently in adjacent files. We pull three layers of context: the changed functions with their signatures, the files that import the changed modules, and the recent git history for the changed files. The prompts are longer and the review quality is substantially better.

The prompt structure is deliberately narrow. We don't ask for general feedback. We ask for specific categories: type errors the static analysis missed, functions with side effects not reflected in their names, error paths that aren't handled, and logic that diverges from the surrounding code's style. Narrow prompts produce reviewable output. General prompts produce wall-of-text that reviewers skip.

What it catches

Type inconsistencies that mypy misses. This happens more than I expected. We use Python with type hints. Mypy catches most type errors. GPT-4 catches cases where the types are technically consistent but the usage is semantically wrong: passing a list where the downstream function expects an iterator and treats it differently than a list would behave.

Missing error cases. The model is reliable at spotting code paths that don't handle failures. Not perfect. But it catches a category of bugs that human reviewers miss when they're moving fast because the happy path looks correct.

Functions doing too much. This is a code smell catch, not a bug catch. The model flags functions above a certain complexity threshold that could be split. We keep these comments but deprioritize them. Useful for refactoring sessions, not urgent.

What it misses

Business logic errors. If a function implements the wrong algorithm but implements it correctly, the model doesn't know. It has no access to the specification. It can only compare the code to what looks normal for similar patterns in its training data. A correctly implemented wrong thing is invisible to it.

Performance issues that require understanding data volume. The model doesn't know that a particular query runs against a table with 50 million rows and will be slow at a join condition it doesn't flag. Context about scale requires context about the system that we don't provide in the prompt.

Security issues that require domain knowledge. Standard injection patterns it catches. Subtle authorization logic errors in our specific permission model it doesn't.

The numbers after six months

We track which AI review comments get addressed by the author versus dismissed. Addressed rate is around 40%. That's higher than I expected. The other 60% are dismissed, usually because the human reviewer disagrees or the issue is already known.

Of the addressed comments, we've tracked back three bugs that made it to staging before the AI review caught them. Three bugs over six months isn't a dramatic number. The cost of the system is low enough that three prevented bugs pays for it.

What we'd change

Tighter context selection. The current approach pulls context by file proximity. A better approach pulls context by semantic relevance: what does this change touch in terms of behavior, not just file structure. We haven't built this yet.

Feedback loop. We don't currently feed dismissed comments back into prompt refinement. The model keeps suggesting things that experienced reviewers consistently dismiss. That's wasted noise. Building a feedback loop that reduces it is on the roadmap.

With gusto, Fatih.