The Definitive Guide to Designing Product Metrics
optimizely.gif
Above: The results page of the experimentation platform Optimizely. It shows an experiment’s delta on different product and funnel metrics. Companies may have analogous internal tools for experimentation and metric analysis.
Background
Reading the news and watching movies like The Social Dilemma, there seems to be a broad desire by the technology industry and society to move past merely using engagement metrics as success. That said, I’ve encountered surprisingly few approachable online resources that discuss how to design a good metric. That’s pretty surprising since this is one of the core responsibilities of data science teams! This guide is meant to be a step forward in filling that gap.
How do I use this guide? Depending on your needs you may want to skip ahead to different parts of this doc:
- 💻 To prepare for interview questions related to metric design, see:
- 🤔 If you have a new metric sketched out but want thoughts on validation, see:
- 📊 If your team already has metrics, and you want to think more about reporting best practices, see:
- 🌐 If you’d like a holistic understanding of product metrics ... just continue reading below 👇
Two buckets of metrics: Precision and Recall
Before we design new metrics, we should understand what existing metrics already measure. Personally, I have found it helpful to bucket metrics into two broad categories: precision and recall.
Analysts can classify existing metrics into these two buckets and find measurement gaps for new metrics to address. Or, this framework can help contextualize proposed new metrics among existing ones. The first recall metric might be more impactful than the tenth precision metric and vice versa.
Precision Metrics 🎯
Precision metrics measure usage and feedback on the current iteration of the product. These metrics are usually derived from product logging. Teams use them to measure growth and optimize features. In fact, I would say the vast majority of metrics that analysts work with and design are precision-like in nature.
Examples include:
- DAU, MAU and other usage metrics: Understanding the total usage and engagement of products and their features is the core deliverable of product analytics teams
- Read more: See the non-revenue metrics in the “Advertising” section of Y Combinator’s Key Metrics guide
- CSAT and other product-focused survey metrics: CSAT-like metrics (rating features or products on a scale of 1 to 5, asking why via free-form responses) focus on collecting feedback on the current state of the product
- Read more: This GetFeedback article outlines two ways to calculate CSAT and different industry benchmarks
- Latency metrics: Latency metrics measure product load times and infrastructure performance
- Read more: Latency metrics are not usually handled by product analyst teams, but very mature products may want to understand the relationship between latency and product growth and satisfaction. This article by Treynor Sloss, Nukala and Rau on how Google thinks about infrastructure metrics may offer some ideas in this area
Recall Metrics 📚
Recall metrics track product performance against a ground truth. If precision metrics measure growth and optimize existing features, recall metrics help measure product quality and drive new feature development. Unlike precision metrics, recall metrics may need more than product logging to measure. User surveys or data labeling by teams of humans may serve as the data inputs to recall metrics. Sometimes, User Experience Research teams will own survey-based recall metrics instead of data science teams which tend to be more focused on logging-based metrics.
Examples include:
- Net Promoter Score (NPS): Arguably, the most famous recall metric is NPS, as customer loyalty for a product is a function of possible alternatives
- Read more: This Qualtrics overview includes tips on NPS follow-up questions in addition to the metric’s calculation
- Recall for recommender or search systems: Recall metrics measure if the end user’s intent was actually fulfilled by a recommendation or search system's results
- Read more: As slide 21 of this Intro to ML lecture shows, you can use actual user likes/clicks as the ground truth for recall grading
- Note: This approach may inflate recall scores since it excludes potential likes that weren’t even available as options on the product. For example, a recall metric should ideally penalize a e-commerce product’s recommendations if a user searches for a product and doesn’t see it listed at all 🤯. Teams can produce high-coverage recall metrics if humans manually score samples of recommender system output based on user intent. Analyst input or guidance can help reduce bias in these manual labeling processes
- Competitive analysis metrics: For products with strong competitors, analysts can design product preference, quality, or task completion metrics to compare products over time
- Read more: This blog post describes how Loup Ventures asked Siri, Google Assistant, and Alexa the same 800 queries and graded each based on response correctness and query understanding. They repeated the exercise three months later to see how much each product had improved on the same set of queries
The Metric Lifecycle
Now that we’ve established what kind of metrics exist, let’s dive into the process of creating, reporting, and possibly sunsetting a metric.
Where do new metrics come from? 👶
Broadly, shifts in product direction and market, product, or customer maturity can drive the need for new metrics. More specifically, new metrics can come from:
- Establishing the user funnel or evolving the existing one: New products will derive their initial set of metrics from their user acquisition and engagement funnels
- Read more: The best tutorial I’ve seen for designing a user engagement funnel and its associated metrics is Lesson 3 from Udacity’s course on A/B testing
- Current metrics don’t respond to feature launches: New features may stop showing statistical and practical improvements on a product’s funnel metrics. This might happen when a product saturates a market, or when the product becomes sufficiently complex. Analyst teams in this situation may need to design new, more actionable metrics
- Read more: In the Causal Proximity section of Designing and Evaluating Metrics, Sean Taylor describes how product and engineering teams should be able to impact the drivers of a metric through feature launches. Metrics lacking this property aren’t actionable
- User complaints or notable losses: Repeated negative user or customer feedback as well as press and market analyst commentary may drive the creation of quality-focused metrics
- Read more: As described in this New America case study (see text around footnotes 82 - 89), starting around 2016 Youtube began measuring video satisfaction via user surveys to better optimize Youtube’s recommendations around user happiness and satisfaction instead of watch time
- Directives from Leadership + Annual Planning: On a more practical note, analyst teams often invest in new metrics during annual planning to align metrics with the product or company strategy for the following year
Proposing a new metric: Interview Frameworks and Goodhart’s Law 📝
Once analyst teams establish the need for a new metric, analysts begin metric design work. Data science teams expect these analysts to reason through the potential second-order effects of using a proposed metric in experiments and performance tracking
Data scientist interviews often ask case study questions around this reasoning process. In my experience, the interviewer will either ask the candidate to design a metric or will propose a metric and ask the candidate to evaluate it. Here is a framework to approach these questions:
- 🤔 Ask clarifying questions to understand data inputs: Make sure you and the interviewer align on what user actions or other data inputs could inform the metric. For example, on a social media product scrolling, likes, link sharing, status posting, messages sent, etc. could all inform a success metric focused on engagement
- Tip: Since data inputs will be product specific, I would recommend studying the product your interviewer will ask about and creating a cheatsheet about the metrics you could envision teams at the company using
- For example, annotating "What are the most important ride sharing metrics for a company like Lyft or Uber?" and its related questions could help familiarize yourself with ride-sharing metrics and their inputs
- Actually trying out a product and understanding its mechanics and features also helps -- a surprising number of candidates don't do this!
- 🤝 Align on what behaviors or properties the metric should measure: Repeat back the question to the interviewer and ask about edge cases. For example, if an interviewer at an e-commerce platform asks to define a metric classifying successful customer accounts, it may be worth asking if the team expects this metric to classify an account with high transaction volume but low and declining NPS as successful
- 🧑🎓 Propose metric:
My advice is to err towards simplicity and let follow-on questions help decide if you need to make your metric more complex - Tip: If a top-line metric, interviewers may follow-up with what time granularity your metric should be. For example, active users on a daily, weekly, or month time scale?
- ⬇️ Discuss what behaviors the metric will discourage: What happens if users only end up doing what the metric measures and stop other behaviors on the product? Answering this hypothetical question teases out undesired second-order effects of a metric
- Example: You propose scrolling as the success metric for a feed-based social media product. If users ended up only scrolling on the product, this would zero out usage on actions that reduce scrolling like posting, commenting, and clicking on links. Is the team alright with incentivizing this outcome?
- 🚧 Discuss how
Product teams could artificially increase (“hack”) the metric : Goodhart’s Law states that when people know their performance is based on a metric, they adjust their behavior to optimize that metric. In the context of technology products, Product managers, designers and engineers may start altering the product in order increase a success metric despite negative trade-offs - Example: You propose scrolling as the success metric for a feed-based social media product. The metric incentivizes designers to make content blocks long or default text very large to force users to scroll more. The metric also incentivizes product managers and engineers to create features and algorithms that help launch content farms and propagate their content so that users have more items to scroll through. Together these changes decrease the quality of the product but also increases the scrolling that occurs by users
- 🏆🙅♂️
Discuss what relevant properties aren’t measured by the metric : Unfortunately, simple and clear metrics will not measure all relevant properties or user behaviors at once. You should outline to your interviewer which relevant properties your metric does not measure. You should still be able to argue that your proposed metric correlates to those relevant properties or if you need to design a new metric for them - Example: If an interviewer asked you to choose a single success metric for a payments platform, you could argue transactions completed works best and is correlated with other important metrics like CSAT and total payments processed even though transactions completed does not directly measure those properties
This framework encompasses what I have experienced in metric design interviews. In the following sections, I discuss more about empirically validating a proposed metric when on an analyst team and obtaining stakeholder buy-in.
Validating a new metric with Experiments and Analysis 📈
After proposing a metric, the next step is to complete data and experimental analysis demonstrating the metric behaves as expected and is actionable. There are a couple steps needed to validate a metric:
- (If relevant) Show distribution of metric at different threshold values:
If a metric is threshold based, show how different possible values of the threshold impact the metric's distribution Example : What percent of users would a churn metric classify as churned if we set churn's threshold at 7, 14, 21, or 28 days? Actually show the distribution at each value as part of your explanation for choosing a particular value - Correlation analysis with relevant existing metrics: Demonstrating correlations is especially useful for new metrics that are quality-focused or refinements of existing metrics
- Example: On a ride-sharing app, I would expect user satisfaction to decrease or level off as the number of stops added during a ride increases. If the new satisfaction metric instead increases as the stops increase, then that might suggest a logging issue or warrant a separate investigation into understanding this user behavior since it is counterintuitive
- Precision/Recall to ground truth: Analysts can make metrics that describe certain actions as “good”, “bad”, or “quality” more meaningful if they use surveys or user studies as ground truth to validate those labels
- Example: Based on a data analysis of logs, a product analyst on a business application might propose defining a “good workflow completion” as one with 3 or less clicks. To convincingly make the case that 3 or less clicks is “good”, the analyst could also collect user survey responses on satisfaction for different workflow lengths and measure the proposed metric’s precision and recall relative to the survey responses
- Designing and Evaluating Metrics describe how analysts can use saved historical experiment data to show if a new metric has a practical and statistically significant effect that an experiment can measureanalysis through experiments: If new features that Product teams believe will move a metric consistently do not change the metric, then the metric may not be actionable as designed. The “Validation” and “Experiment” bullet points of the Lifecycle of Metric section of
Getting stakeholder approval on a new metric 🤹
At some point in the validation process the data science team will need to present the new metric and its behavior to engineering and product teams for buy-in and feedback to use as a product success metric.
Metric validation is a form of data analysis, so I would keep in mind Roger Peng’s advice on the topic: A data analysis is successful if the audience to which it is presented accepts the results. Make sure the Product and Engineering teams’ gut checks pass with this proposed metric and the analysis associated with it.
Monitoring and reporting a metric 🗓️📊
After stakeholder approval, your metric should be ready for logging! Communicating, visualizing, and telling stories with data is a topic that could fill a whole book. That said, here are some tips which might be helpful:
Metric Formatting and Communication
- Use a 7-day rolling average to account for weekday/weekend variation in daily metrics, which affect most products
- Teams often recommend reporting metrics based on surveys or sampled logs with 95% confidence intervals
- Align with your team on how to communicate changes in percentage-based metrics — it always gets confusing, quickly
Dashboards
- Both engineers and analysts can easily launch dashboards, so it can become difficult for stakeholders to know which ones are the best-suited to answer their questions. At large technology companies, data science teams will create their own “trusted” set of dashboards on a specific website/application that the Product Leadership views as truth
- It’s simple advice, but I like what Eric Mayefsky said in this blog post — actually look at the dashboards you create!
Use them to inspire deeper data investigations. “Soak in the data. Don't think of the dashboards and reports you build as products for someone else—spend time on a regular basis, ideally daily, just messing around”
Quarterly/Scheduled Metric Reviews
- Besides experiment reporting, scheduled metric reviews and trends analysis with Engineering and Product leadership can drive metric impact at larger organizations
- Common questions-to-be-answered include attribution of metric shifts to new features/customers or external effects, comparison of metric value to forecasts, deep dives into the causes of why metrics are trending downwards or not meeting goals, cohort analyses, etc.
Sunsetting a metric 🌅
In my experience, analysts rarely deprecate metrics. Instead, data engineering teams and code stewardship drive metric deprecation.
Data teams try to optimize the time it takes to calculate core metrics, often on a daily basis. The more metrics their daily jobs need to compute the longer it takes those jobs to run and the greater the probability of a job failing. These engineering teams have an incentive to stop calculating metrics that analysts and product teams neither monitor nor could find useful.
Unimportant or unadopted metrics may lack owners after theirs leave a team and logging is not claimed after a long period of time of no updates. Data engineering teams have leeway to deprecate these ownerless metrics.
Summing it up: the metric design checklist 📋
So there you have it! We’ve discussed ....
- 🎯📚 Two buckets of metrics: precision and recall
- 👶 Why and where new metrics come from
- 📝 How to think through a metric’s 2nd order effects (and how that question appears in interviews)
- 📈 Validating a metric with correlation and sensitivity analysis
- 🗓️📊 Tips for metric monitoring and reporting
- 🌅 Why and how metric sunsetting occurs
As always, I consider this to be a living document and I am open to feedback — feel free to shoot me a note at zthomas.nc@gmail or on LinkedIn if you have any thoughts! Thanks.