AI-assisted sprint planning: the actual workflow

Not a concept post. Here's exactly what we do, what the AI produces, and what we do with it.

The input

At the start of sprint planning, we feed the model three things: the product backlog items prioritized for this sprint, the sprint goal drafted by the product owner, and the last three sprint retrospective summaries. Total context: usually 4,000-8,000 tokens. We use Claude 2 for this because the context window handles larger backlogs without truncation.

The prompt

We ask for four things: a check on whether the sprint goal is achievable given the backlog items, identification of dependencies between items that might affect sequencing, a list of risks or unknowns that should be discussed before committing, and any backlog items that seem under-specified for estimation.

We don't ask for estimates. The model doesn't know the team's velocity, the codebase, or the specific technical constraints. Asking for estimates produces numbers that look authoritative and aren't. We ask for flags, not forecasts.

What comes back

Consistently useful: the dependency check. The model identifies when two backlog items touch the same code area or share an assumption about another item's completion. Human planners miss these in long backlogs. The model doesn't.

Consistently useful: the under-specification flags. Items that say "improve performance" without a metric, acceptance criteria that are ambiguous, items that reference systems or external dependencies without accounting for them. These are the items that blow up mid-sprint. Surfacing them in planning is worth the time.

Inconsistently useful: the risk identification. Sometimes it catches real risks from the retrospective patterns. Sometimes it produces generic risks that apply to any sprint ("ensure testing is thorough", "monitor for scope creep"). The specific risks are useful. The generic ones are noise. We've tuned the prompt to ask for risks with evidence from the retrospectives rather than general risks, which has improved the signal-to-noise ratio.

Not useful: anything about the sprint goal alignment. The model doesn't have enough context about the product strategy to evaluate whether the sprint goal is actually right. It evaluates whether the goal sounds coherent, not whether it's the right thing to be doing. We stopped asking for this.

The human layer

The model output is one input to the planning meeting. It goes out to the team before the meeting with a note that it's a discussion prompt, not a recommendation. The team uses it to structure discussion of risks and dependencies. They add the context the model doesn't have.

The meeting is shorter. The items we discuss are more focused. The surprises mid-sprint are fewer. That's the outcome we care about.

What it doesn't replace

The judgment call on priority. The model can flag that item A depends on item B. It can't tell you whether item A is the right thing to build at all. That requires understanding the product strategy, the customer context, and the competitive situation. That remains human.

The estimation. We tried asking for confidence ratings on estimates. The model produces numbers that the team second-guesses and then ignores. The estimation meeting is faster with the dependency and specification flags surfaced in advance, but the estimates themselves come from the people who will do the work.

The honest cost-benefit

Thirty minutes to set up and maintain the prompt and pipeline. Five minutes to run before each sprint. The time saved in the planning meeting and the reduction in mid-sprint surprises covers the overhead easily. This is one of the cleaner AI productivity wins we've found.

With gusto, Fatih.