Analytics, Measurement, and Control
By the end of this module you will be able to:
- Design a digital measurement framework using OKRs linked to a North Star metric, and explain the governance role of each
- Distinguish between KPIs and KRIs, and explain when each is the appropriate operational instrument
- Describe funnel analysis, cohort analysis, and A/B experimentation, and match each to the question it is designed to answer

Real-world programme · Google · OKR implementation, 1999
One sheet of paper. Every team from 5 to 50,000 people. The same measurement framework.
In 1999, venture capitalist John Doerr walked into Google's office with a single sheet of paper and a presentation describing OKRs (Objectives and Key Results). Google had 40 employees. Doerr's first OKR for the company included the objective “achieve 10x user growth” with the key result “launch beta product with 1 million users by Q4 1999.” That sheet of paper became the measurement operating system for one of the world's largest technology organisations.
By 2018, every Google team from a five-person engineering squad to the 50,000-person advertising division used the same OKR framework to align their quarterly objectives with company-level goals. Without that measurement discipline, it would have been impossible to know whether any given team was contributing to what mattered, or simply generating impressive activity that produced no strategic value.
The previous module covered how data moves through pipelines into the gold layer. This module covers what you do with it once it arrives: how to measure what matters, how to distinguish performance from risk signals, and how to run experiments that turn product decisions into testable hypotheses rather than opinions.
If you cannot measure it, can you manage it? And if you can measure it, are you measuring the right thing?
With the learning outcomes established, this module begins by examining measurement frameworks in depth.
8.1 Measurement frameworks
Three measurement frameworks dominate digital programme governance: OKRs(Objectives and Key Results), the Balanced Scorecard, and DORA metrics. Each serves a different level and type of decision.
OKRs are appropriate for strategic alignment across product, engineering, and business teams. An OKR defines a qualitative Objective (a direction) and measurable Key Results (the evidence of progress). OKRs cascade from company level to team level, creating a measurement hierarchy that links individual sprint work to organisational strategy without requiring central approval of every decision.
The Balanced Scorecard, developed by Kaplan and Norton in 1992, measures performance across four perspectives: financial, customer, internal processes, and learning and growth. It is better suited to mature organisations with stable business models than to fast-moving digital product teams. Many UK public sector organisations use Balanced Scorecard as the governance measurement layer above OKRs at the programme level.
DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Restore) measure engineering team delivery performance. The DORA 2023 State of DevOps Report identified that elite engineering performers deploy multiple times per day, with lead times under one hour, change failure rates below 5%, and recovery times under one hour. DORA metrics are the standard instrument for benchmarking engineering delivery capability against the industry.
“An Objective is a significant, concrete, action-oriented and (ideally) inspirational goal. Key Results are a set of metrics that measure your progress towards the Objective.”
Doerr, J., Measure What Matters - Penguin Portfolio, 2018
The distinction between objectives and key results matters for governance. Objectives describe direction; key results provide evidence. A team that rewrites its key results mid-quarter when they become inconvenient is not using OKRs: it is using theatre. OKRs only create accountability when the key results are fixed at the start of the period and the outcome is reported honestly against them.
OKRs, the Balanced Scorecard, and DORA metrics each serve a different governance level. Section 8.2 narrows the focus to a distinction that trips up many practitioners: the difference between KPIs (lagging indicators of what happened) and KRIs (leading indicators of what may happen).
8.2 KPIs vs KRIs
KPIs (Key Performance Indicators) measure outcomes: what the service delivered over a defined period. KPIs are lagging indicators. They tell you what happened. A monthly transaction completion rate of 97.8% is a KPI. It reflects the past.
KRIs (Key Risk Indicators) measure exposure: the current level of risk or the probability that a risk event will occur. KRIs are leading indicators. They change before the risk materialises in a KPI. A rising third-party API error rate, an increasing payment processing latency at the 95th percentile, or a declining database connection pool availability are KRIs signalling degradation that will appear in the completion rate KPI only after customers have already experienced failures.
The distinction determines the operational response cadence. KPIs are reviewed in weekly business reviews, informing strategy and investment. KRIs are monitored continuously in real-time dashboards and trigger automated alerts to on-call engineers. A KRI breach at 3am requires an operational response now; it does not wait for the weekly review.
Good governance programmes maintain both layers. KPIs without KRIs produce reactive organisations that discover problems after customers are already affected. KRIs without KPIs produce organisations that optimise for leading indicators that never translate to outcomes. Both are needed.
Common misconception
“KPIs and KRIs are the same measurement at different frequencies.”
They measure fundamentally different things. A KPI tells you what the service achieved last week. A KRI tells you how likely the service is to fail this week. A declining KRI requires an operational response today, even if the KPI looks fine, because the KPI has not yet reflected the deterioration that the KRI is already signalling. Treating them as the same metric at different intervals misses the entire governance value of leading indicators.

Tracking KPIs and KRIs gives you a complete picture. But with dozens of metrics available, teams need a single number that everyone pulls towards. Section 8.3 covers North Star metrics - the one measure that best captures the value delivered to users.
8.3 North Star metrics
A North Star metric is the single number that best captures the value a product delivers to its customers. Every other metric is subordinate to it. Spotify's North Star is time spent listening. Airbnb's is nights booked. The NHS App's North Star in 2022 was prescriptions managed digitally, reflecting the programme goal of reducing GP surgery administration load.
Choosing the right North Star is the hardest part of the framework. It must be a leading indicator of long-term value rather than a lagging measure of past revenue. Airbnb chose nights booked rather than revenue because nights booked reflects value delivered to both guests and hosts; revenue is a downstream consequence. It must be actionable: product teams must be able to influence it directly through their feature decisions.
The governance value of a North Star is that it provides a decision rule for trade-offs. When two product teams disagree about which feature to prioritise, the North Star provides an external arbiter: which option is more likely to move this number? That shared reference eliminates the politics of competing KPIs and focuses debate on evidence rather than advocacy.
“The North Star Metric is the single metric that best captures the core value that your product delivers to customers.”
Amplitude, North Star Playbook - 2019, amplitude.com
The practical test for a North Star candidate is whether it survives a feature trade-off. If a new feature increases daily active users but reduces the North Star, and the team chooses to ship it anyway, the metric is not actually the North Star. The metric only has governance value if it is the non-negotiable decision rule, not one input among many.
Frameworks tell you what to measure. Tooling determines how you measure it and who can access the data. Section 8.4 maps the analytics tooling landscape and explains how different tools cover different parts of the measurement stack.
8.4 Analytics tooling landscape
Three categories of analytics tooling serve different measurement purposes. Each category has distinct data models, query patterns, and organisational users.
Product analytics tools (Google Analytics 4, Amplitude, Mixpanel) track user behaviour within digital products: page views, feature interactions, conversion funnels, retention cohorts. GA4, released in 2020, introduced an event-based data model replacing the session-based model of Universal Analytics. Amplitude and Mixpanel are purpose-built for product teams, offering funnel analysis, A/B test result views, and cohort retention charts without requiring SQL. All three support North Star metric tracking and OKR key result dashboards.
Operational analytics tools (Grafana, Datadog, New Relic) monitor infrastructure and application performance. They track KRIs: request error rates, latency percentiles, database connection pool utilisation, queue depths. These tools power on-call alert responses. Grafana is open-source and dominant in UK public sector and NHS Digital deployments; Datadog is the commercial leader in enterprise environments.
Business intelligence tools (Power BI, Tableau, Looker) connect to data warehouse gold layer tables and serve the weekly KPI reporting cadence: revenue, transactions, compliance metrics, operational throughput. BI tools are consumer-facing dashboards for non-technical stakeholders. Their data is stale by hours rather than seconds; they are not appropriate for operational KRI monitoring.
With the right tooling in place, teams can perform more sophisticated analysis. Section 8.5 covers two core analytical techniques: funnel analysis for identifying where users drop off, and cohort analysis for understanding how behaviour changes over time.
8.5 Funnel analysis and cohort analysis
Conversion funnel analysis maps the sequence of steps users take towards a goal and shows the percentage progressing from each step to the next. It answers the question: at which step are we losing people, and by how much? The steps in an AARRR funnel (Acquisition, Activation, Retention, Referral, Revenue) represent progressively deeper engagement. Each transition is a measurement point.
When HMRC launched Making Tax Digital for VAT in 2019, funnel analysis of the onboarding journey revealed that 38% of registered businesses were abandoning at the accounting software authorisation step. Session recordings identified that the OAuth authorisation screen looked untrustworthy to users who did not recognise the pattern. The funnel data directed UX improvement effort to the specific step causing the drop-off, rather than spreading effort across the entire journey.
Cohort analysis groups users by a shared characteristic (sign-up week, acquisition channel, first product version used) and tracks their behaviour over time. Cohort analysis surfaces patterns that aggregate metrics obscure. A stable overall retention rate can mask a deteriorating trend: if the most recent cohorts are retaining at 30% and older cohorts are retaining at 55%, the aggregate is being held up by older users, and the product has a structural problem that will only appear in aggregate data several quarters later. Cohort analysis shows the problem now.
Funnel analysis asks “where are we losing users?” Cohort analysis asks “are the users we acquire today behaving differently from the users we acquired six months ago?” Both questions are essential; neither can be answered by aggregate metrics alone.
Funnel and cohort analysis tell you what users are doing. Experimentation tells you why. Section 8.6 covers A/B testing and the statistical concepts that prevent teams from drawing false conclusions from noisy data.
8.6 Experimentation
Experimentation culture treats product decisions as hypotheses to be tested rather than opinions to be debated. An A/B test presents two variants to randomly selected user groups, measures the primary metric for each, and determines whether the difference is statistically significant or within the range of chance variation.
Booking.com runs over 25,000 concurrent A/B tests across its platform. Monzo uses feature flags to control exposure of new features to percentage cohorts before full release, combining A/B testing with progressive rollout. Optimizely and LaunchDarkly are the two dominant feature flag and experimentation platforms; both handle statistical significance calculation, making it possible for product managers without statistics backgrounds to run valid experiments.
The prerequisite for valid A/B testing is sufficient traffic volume. Statistical significance requires enough users in each variant to distinguish a real effect from noise at the chosen confidence level (typically 95%). A service with 200 daily users cannot run a 2-week experiment with a realistic effect size and achieve significance. Below this threshold, qualitative methods (usability testing with 5 to 8 participants, contextual interviews, session recording analysis) are more informative per unit of investment.
Common misconception
“More metrics equals better measurement.”
Metric overload causes analysis paralysis. When a dashboard contains 80 KPIs, no one reviews them consistently, anomalies are missed, and measurement becomes a reporting exercise rather than a decision tool. The North Star framework exists precisely to counter this pattern: one primary metric with supporting KPIs, not an exhaustive catalogue. If a metric cannot trigger a specific decision, it should not be on a dashboard.

A UK fintech product team has a North Star metric of 'payments completed per active user per month'. A new feature increases card storage rate from 38% to 64% within three weeks of launch. Three months later, the North Star metric shows no change. Which analytical method would most directly test whether card storage was the correct feature to prioritise?
A payment processing service monitors its weekly transaction completion rate (KPI: 98.1% this week). A real-time Grafana dashboard shows the third-party card network API error rate has risen from 0.3% to 2.1% over the past 90 minutes. The product owner says the KPI looks healthy so there is no immediate action required. What is the most accurate response?
A digital public service team ships a redesigned benefit application form that improves overall submission rate from 61% to 74%. Six months later, the team notices that appeal rates among users acquired through the redesigned form are 18 percentage points higher than among users acquired previously. Which analytical approach most efficiently diagnoses whether the redesign is causing this pattern, and what might the underlying explanation be?
Key takeaways
- OKRs cascade from company to team level, linking qualitative Objectives to measurable Key Results. They create measurement alignment without requiring central approval of every decision, provided the key results are fixed at the start of the period.
- DORA metrics (Deployment Frequency, Lead Time, Change Failure Rate, MTTR) are the industry standard for benchmarking engineering delivery performance. Elite performers in 2023 deploy multiple times daily with sub-hour lead times.
- KPIs measure past performance; KRIs measure current risk exposure. Both are needed at different operational cadences: KPIs in weekly reviews, KRIs on real-time dashboards with automated alerting.
- A North Star metric provides a single decision rule for product trade-offs. Its governance value depends entirely on teams treating it as a non-negotiable arbiter rather than one input among many.
- Funnel analysis identifies where users fail to progress towards a goal. Cohort analysis reveals whether different user segments behave differently over time. Aggregate metrics cannot substitute for either.
- A/B testing requires sufficient traffic volume for statistical significance. Below a service-specific threshold, qualitative research (usability testing, contextual interviews) is more informative per unit of investment.
Standards and sources cited in this module
Doerr, J., Measure What Matters
Penguin Portfolio, 2018
OKR framework definition, structure, and Google implementation case study. Quoted in Section 8.1 and cited as the primary OKR methodology source throughout the module.
DORA State of DevOps Report 2023
dora.dev
Elite performer benchmarks for Deployment Frequency, Lead Time, Change Failure Rate, and MTTR. Referenced in Section 8.1 and used for the terminal simulation benchmark values.
Amplitude, North Star Playbook
2019, amplitude.com/north-star
North Star metric framework definition, selection criteria, and real-world examples including Spotify and Airbnb. Quoted in Section 8.3.
McClure, D., AARRR Startup Metrics for Pirates
500 Startups, 2007
AARRR framework defining Acquisition, Activation, Retention, Referral, and Revenue as the standard funnel model for digital products. Referenced in Section 8.5 and the DragSort challenge.
HMRC Making Tax Digital, GOV.UK
gov.uk/making-tax-digital
Real-world funnel analysis case study: 38% drop-off at OAuth authorisation step cited in Section 8.5 as an example of funnel analysis directing UX improvement effort.
GOV.UK Performance Platform, GDS
2013 to 2019
UK public sector OKR and KPI transparency example. Demonstrated that publishing digital KPIs publicly drives cross-team improvement and stakeholder accountability.
Measurement tells you whether digitalisation is working. The next module examines the integration patterns that connect digital systems to each other: REST APIs, event-driven architectures, and the design decisions that determine whether those connections are maintainable at scale.
Module 8 of 15 in Applied