Multi-modal models and what they mean for computer vision products

GPT-4V launched this month. A general-purpose language model that can also process images. I've been building computer vision products for three years. Here's what I think it means.

What GPT-4V actually does

It takes an image and text as input and produces text as output. You can ask it to describe an image, answer questions about an image, compare two images, or reason about visual content in the context of a text prompt. The visual reasoning quality is genuinely impressive on natural scenes, documents, charts, and standard object recognition tasks.

What it doesn't do is run fast. The inference latency is incompatible with real-time visual inspection in logistics. For our core use case, it doesn't replace what we built.

What it does change

The workflow around our core product. The anomaly explanation use case I've been thinking about for months: GPT-4V can look at the flagged image, compare it against a reference image of an authentic product, and generate a natural language explanation of why they differ. That's genuinely useful. It's a task that would have required building a specialized comparison model or writing complex post-processing logic. Now it's a prompt.

Quality control on training data. Reviewing labeled images to catch mislabeled examples is tedious manual work. GPT-4V can do a first pass that catches obvious labeling errors before human review. We've been testing this and the false positive rate on mislabeled examples is low enough to be useful.

Documentation generation from visual assets. Generating technical documentation from architectural diagrams, flow charts, hardware setup photos. Previously required someone to describe the image in text before the language model could work with it. Now it's a direct input.

The harder question

Does GPT-4V change the competitive landscape for CV startups?

For commodity CV tasks, yes, eventually. Object detection, image classification, OCR, basic scene understanding: the multi-modal foundation models will get good enough that building custom models for these is hard to justify. If your product is doing something that GPT-4V does adequately, the product is under pressure.

For domain-specific CV at production scale, no, not yet. The accuracy on specialized domains (logistics, medical imaging, industrial inspection) is not competitive with domain-specific models fine-tuned on domain-specific data. The latency is not compatible with real-time requirements. The deployment architecture (edge hardware, on-premise for compliance) is not supported.

The timeline matters. If the gap between general-purpose multi-modal models and domain-specific models closes in two years, the defense for specialized CV products is weaker than it looks today. If the gap persists, the defense is durable.

What we're doing

Integrating GPT-4V into the non-real-time parts of our product where it adds value. Anomaly explanation and training data review are the immediate applications.

Building faster and deeper on the parts of the product that foundation models can't easily replicate: the domain-specific model tuned on our data, the edge deployment architecture, the logistics-specific integration layer.

The CV products that survive the next few years will be the ones that are differentiated on domain depth and deployment context, not on general visual recognition capability. Foundation models will commoditize the latter. They'll take longer on the former.

With gusto, Fatih.