4 min read

Building pitch analysis with AI: what the algorithm can and can't tell you about a founder

Several hundred pitches analyzed. What NLP actually surfaces — and where the ceiling is.

We've processed several hundred pitches now. Enough to start seeing patterns. Not enough to publish a paper, but enough to have opinions about what NLP actually surfaces when you point it at a founder talking about their company.

The short version: the algorithm is good at structure and bad at truth.

What we built

The pipeline is roughly this. Founder records a pitch, three to five minutes. We run Watson STT to get a transcript. We chunk the transcript into segments and run NLU on each: entities, keywords, sentiment, categories. We run the Tone Analyzer on the full document. We time the segments to calculate pacing. We flag filler words. We check whether certain structural elements appear: problem statement, solution, market size, traction, ask.

The output is a report. Structure score, delivery score, clarity score, and a list of flagged items with timestamps.

What correlates with outcomes

We've been running a manual review layer alongside the automated one. Human reviewers who have done investor relations work or been founders themselves. We compare their notes to the algorithm's output.

The algorithm is genuinely good at detecting whether the problem statement comes early. If a founder spends the first two minutes on the company history and doesn't get to the problem until minute three, we catch that. Every time. The structural sequencing detection works.

Pacing correlates weakly but consistently with reviewer scores. Founders who speak faster than 180 words per minute tend to get lower scores on clarity, independent of content. The algorithm catches this. The reviewers feel it without measuring it.

Filler word density is a reasonable proxy for preparation level. Founders who have practiced come in under 3% filler density. First-timers run 8-12%. This doesn't say anything about the quality of the company. It says something about whether they've rehearsed.

What doesn't correlate

Sentiment. The tone of the pitch, as measured by NLU, does not predict reviewer scores. Confident, enthusiastic, optimistic founders score no better than flat, dry, technical ones. The reviewers are reading something the algorithm isn't. They're reading whether the person knows what they're talking about, whether the numbers make sense, whether there's something specific behind the generic claims. The algorithm has no access to any of that.

The Watson Tone Analyzer's "confident" score, specifically, is almost useless for this use case. It's picking up on linguistic markers of certainty, words like "will" versus "might", definitive statements versus hedged ones. Good founders hedge appropriately. They say "we think" when they don't know. The algorithm reads this as tentativeness. The reviewers read it as intellectual honesty.

Market size mentions are easy to detect and almost meaningless. Every pitch mentions TAM. Detecting that someone said a number followed by "billion" doesn't tell you whether they understand their market.

The thing that surprised me

Specificity is the strongest signal and we didn't build a model for it. Reviewers consistently rate pitches higher when the founder uses specific numbers, specific customer names, specific outcomes. "We have 40 paying customers, three of whom are in logistics" versus "we have strong early traction in logistics." The second one scores fine on sentiment, reasonable on structure, but the reviewers tank it. We're working on a specificity detector now. It's harder than it sounds because you need to distinguish genuine specificity from specificity theater, where founders have learned to front-load numbers without those numbers meaning anything.

Where this leaves us

The algorithm is a useful filter for basics. Structure, pacing, preparation level. If you score low on structure, there's something concrete to fix. That's valuable feedback, especially for first-time founders who have never received systematic input on their pitch.

The algorithm cannot evaluate whether the company is good. It cannot tell you if the founder is the right person. It cannot tell you if the market analysis is credible. Those require a human who knows the domain, and probably a conversation.

We knew this going in. What I didn't expect is how visible the ceiling is. After about 200 pitches you develop an intuition for the cases where the algorithm gives a high score and you know immediately from reading the transcript that the company is weak. The gap is consistent and predictable. I'm not sure it's closeable with more NLP. I think it might require a different kind of model entirely, one trained on actual investor outcomes rather than linguistic features.

Nobody has that dataset. Maybe we'll build it.

With gusto, Fatih.