Not a hot take. A practical decision made after running both in production for the same tasks. Here's what drove it.
The context
Claude 2 launched last week with a 100K token context window. That's the specific change that made the switch worth revisiting. We'd evaluated Claude 1 earlier this year and found GPT-4 better on the tasks we were using it for. The context window is a different kind of improvement and it changes the calculus for our use cases.
What changed with 100K context
The internal tooling tasks where we use LLMs are almost all long-context tasks: code review on large PRs, incident summarization from long chat logs, documentation generation from detailed specifications. GPT-4's 8K context window (or 32K with the limited-access 32K model) requires chunking these tasks. Chunking introduces complexity and loses cross-document coherence.
100K tokens covers our largest PRs in a single pass. It covers full incident timelines without truncation. It covers the entire specification document rather than sections of it. The improvement to output quality from removing chunking is larger than I expected. The model reasoning about the whole is better than the model reasoning about parts and merging the results.
Where Claude 2 performs better
Instruction following on complex, structured prompts. We use prompts with multiple constraints, output format requirements, and conditional logic. Claude's instruction adherence on these is more consistent. GPT-4 occasionally drifts from the specified output format or misses a constraint that was clearly stated. The difference isn't large but it's consistent enough to matter in automated pipelines where inconsistent output format breaks downstream processing.
Writing quality on internal documentation. This is subjective but consistent across the team: the documentation drafts from Claude read more naturally. Less hedging, better sentence structure, less tendency to produce bullet lists when prose would communicate better.
Where GPT-4 still performs better
Mathematical reasoning in code. For tasks involving complex logic, algorithms, or numerical analysis in code review, GPT-4 is more reliable. Claude 2 makes errors on these that GPT-4 doesn't.
Coverage of specific technical domains. GPT-4's training data coverage of niche technical areas is broader. For the less common libraries and frameworks we use, Claude occasionally lacks context that GPT-4 has.
The practical decision
For our highest-volume internal tooling use cases: Claude 2. The context window advantage dominates for long-document tasks, and the instruction following is more reliable for automated pipelines.
For code review on complex algorithms and for tasks where domain-specific technical depth matters more than context length: GPT-4.
Running both isn't ideal. The switching cost is low enough that we're doing it for now. If I had to pick one for everything, I'd pick Claude 2 for our specific mix of tasks.
What I'm not saying
This is not a general claim about which model is better. It's a specific claim about which model performs better for our specific tasks, evaluated on our specific data, in mid-2023. The rankings will change. Both models are improving. Evaluate on your own data for your own tasks. Anybody telling you one is definitively better without specifying the task is not giving you useful information.
With gusto, Fatih.