What Zillion Pitches taught me about building with AI

Zillion Pitches is winding down. Two years of building AI for founders. Several hundred pitches analyzed. Time to say what we actually learned.

The product worked. The business was harder.

The pitch analysis product did what it was supposed to do. Structural detection, delivery scoring, transcript review. Founders found it useful, particularly for the structural feedback. The NPS was good. The retention was fine.

The business model was harder. Who pays for pitch coaching, how much, and how often are questions we didn't answer as well as we answered the technical questions. We optimized for building the ML pipeline and under-invested in understanding the customer's willingness to pay and the frequency of use. Pitch practice is a periodic need, not a daily one. The unit economics were difficult.

This is a lesson about AI products specifically: the technical differentiation is necessary but not sufficient. Building something technically interesting is easier than building a distribution strategy for something technically interesting.

What the ML got right and wrong

We got structural analysis right. The model for detecting whether a pitch contains clear problem, solution, market, and traction sections is accurate and useful. Founders who score low on structure have fixable problems. The feedback is actionable. This part worked.

We got tone analysis wrong for longer than we should have. I wrote about this in July. The Tone Analyzer was theater. We knew this relatively early and kept it in the product longer than we should have because it looked good in demos. Removing it was the right call. We should have made it sooner.

The specificity detector we were building when I wrote about it in May is the most interesting technical piece we built. Training a model to distinguish genuine specificity from specificity theater required labeled data and the labeling criteria were hard to write down clearly. "We have forty paying customers" is specific. "We have strong traction" is vague. "We're seeing great momentum across three verticals" sounds specific and isn't. The boundary required judgment that was hard to operationalize. We built something that works reasonably well. It would need more training data to be production-reliable.

What I'd do differently

Talk to more customers before building the ML. We had a product hypothesis and we built the ML to test it. The right sequence was to validate that the problem was real and that founders would pay for a solution, and then build. We reversed the order because the ML was more interesting to build.

Instrument the product from day one to measure whether the AI output is actually changing user behavior. We added instrumentation later and learned things we should have known in month three.

Don't ship AI features you can't explain to a non-technical user in one sentence. If the explanation requires the word "model" or "algorithm," the feature is probably not ready.

What I'm taking forward

Building NLP in production for two years taught me more than any paper could. The Watson integration, the fine-tuning experiments, the specificity detector, the gap between benchmark accuracy and production accuracy. These are things you only learn by running real data through real systems at real latency.

The more durable lesson is about the relationship between AI capability and product value. They're related but not the same. A more accurate model doesn't automatically produce a more valuable product. The value is downstream of the model, in what the output enables the user to do. Getting that right requires understanding the user at least as well as you understand the model.

We understood the model better than the user. That's fixable in the next thing.

With gusto, Fatih.