Skip to content

Defining success for probabilistic features

Probabilistic features fail when PMs ship them with deterministic metrics. Strong interview candidates show a different toolkit.

Probabilistic features behave nothing like a checkout button. A search ranker produces different outputs for the same input over time. The same applies to fraud detectors. PM interviewers ask about this because most candidates default to deterministic thinking. Many name CTR or conversion as the success metric and stop at the surface.

Strong PM candidates take the next step. A good answer covers the error modes and connects model metrics to user outcomes.

Start with the cost of being wrong

Every probabilistic feature has two failure modes. A false positive happens when the model says yes when the answer should have been no. The reverse case is a false negative: the model says no when the answer should have been yes. Each error type has a different cost.

Take a fraud detection system. A false positive blocks a legitimate purchase and frustrates a paying customer. The opposite mistake, a false negative, lets a stolen card through and costs the company money. Which one hurts more depends on the business. Marty Cagan writes that product teams must understand the risks before they build the solution (Cagan).

Before picking a metric, ask the interviewer two questions. What happens when we falsely flag something? What happens when we miss something? The answers drive every other decision.

Precision and recall, in plain terms

Precision answers a simple question. Of all the items the model flagged, how many were true positives? A spam filter with 95 percent precision means 95 of every 100 emails it sent to spam really are spam.

Recall answers a different question. What share of real positives ended up in the model's flagged set? A spam filter with 60 percent recall caught 60 of every 100 real spam emails.

Optimize for precision when false positives carry a high cost. Content moderation at scale leans this way. Wrongly removing a creator's post damages trust. Recall matters more when false negatives carry a high cost. Cancer screening leans this way. Missing a true case has worse consequences than a follow-up test.

The trade-off is a real constraint. Pushing precision up usually drops recall. PMs who skip this nuance get sorted into the "shallow on AI" bucket.

When F1 helps and when it lies

F1 is the harmonic mean of precision and recall. It gives one number when you care about both metrics. Many PMs cite F1 because it has the look of rigor.

The metric carries a hidden assumption. Precision and recall get equal weight in the formula. That assumption rarely holds in real products. A better alternative is the F-beta score. F-beta lets you set a weight that favors recall or precision, depending on the beta value. For a fraud system that cares twice as much about catching fraud as blocking good purchases, F2 is more honest than F1.

Mention F1 first if the interviewer brings up metrics. Then propose F-beta with the reasoning. That move signals depth.

Task completion rate is the user-facing metric

Model metrics measure the model. Product metrics measure the user. Both matter, but the second one wins debates with leadership.

Task completion rate measures whether the user achieved the goal. For a search feature, did the user click a result and stay on the page? In a writing assistant, did the user accept the suggestion or rewrite the text? Lenny Rachitsky has written that the strongest product teams pick one metric that reflects user value and rally the team around it (Rachitsky). Task completion rate forces that discipline.

A model can have great F1 and a terrible completion rate. That gap is where most AI products die in production.

In an interview, pair a model metric with a user metric. Precision plus completion rate. Recall plus retention. The pairing shows you can connect math to behavior.

Human-in-the-loop metrics matter more than people think

Most production AI systems include a human reviewer at some point. Reviewers approve flagged content. Customer service agents step in when the bot hits its limit. Four metrics matter for these systems.

Override rate captures how often humans reverse the model's decision. A high override rate means one of two things. Either the model is calibrated incorrectly, or the human team has lost trust in the system.

Time saved per task captures whether the model cut review time or just shifted the work. If a moderator reviews 200 items in the time it took to review 100, the system is doing its job.

Escalation rate captures the share of cases that need human help. Track it over time. A rising rate signals model regression or input drift.

Confidence calibration measures how well the model's stated probabilities match reality. When the model says it is 90 percent confident, the prediction should be right 90 percent of the time.

Jeff Gothelf points out that outcomes beat outputs, and the human side of the system is part of the outcome (Gothelf).

How to structure your interview answer

When you get a question like "How would you measure success for this AI feature?", use a four-part structure.

First, name the failure modes and the cost of each error type.

Second, pick the model metric that matches the cost asymmetry. F-beta with a stated weight is often more honest than F1.

Third, propose a user-facing metric like task completion rate or retention.

Fourth, add a human-in-the-loop metric if the system has reviewers.

That structure works for fraud detection, search, recommendations, chatbots, content moderation, and most generative features. Practice it on three different products before the interview.

Works cited

Cagan, Marty. Inspired: How to Create Tech Products Customers Love. Wiley, 2017.

Gothelf, Jeff, and Josh Seiden. Lean UX: Designing Great Products with Agile Teams. O'Reilly Media, 2021.

Rachitsky, Lenny. "Choosing Your North Star Metric." Lenny's Newsletter, 2022, www.lennysnewsletter.com.

Back to Live Blog