Loading lesson...
Loading lesson...
This is the first of 8 Practice & Strategy modules. You have built, evaluated, and deployed individual models across the Foundations and Applied stages. Now the question shifts from "how do I build one model?" to "how do I build a system that runs thousands of models reliably?" The architectural patterns in this module underpin every production ML platform you will encounter.

Production ML at scale · 2023
Spotify does not run one recommendation model. It runs thousands. There are models for Discover Weekly, Release Radar, home screen recommendations, podcast suggestions, and search ranking. Each model consumes different features, trains on different schedules, and serves predictions at different latencies. A single listening session can trigger dozens of model inferences before the user hears a note.
Coordinating this requires more than good data science. It requires an ML platform: shared feature stores so models do not each re-derive the same user embeddings, experiment tracking so the team can compare hundreds of A/B tests simultaneously, orchestration pipelines that retrain models on schedule without human intervention, and a model registry that tracks which version of which model is currently serving which percentage of traffic.
The lesson from Spotify is that ML system design is not an afterthought bolted on after the data scientist finishes a notebook. It is the discipline that determines whether a promising prototype ever reaches a single user.
How do you coordinate thousands of models that must each produce a recommendation within 200 milliseconds?
Spotify's challenge is not unique. Every organisation that moves from one experimental notebook to multiple models in production faces the same architectural questions. This module walks through each layer of an ML system, from data ingestion to monitoring, so you can recognise the patterns and avoid rebuilding infrastructure that already exists as mature tooling.
If you have already designed ML pipelines in production, use the knowledge checks to confirm your understanding and skip to Module 18: Scaling and cost.
With the learning outcomes established, this module begins by examining the end-to-end ml pipeline in depth.
An ML pipeline is not a single script. It is a directed acyclic graph (DAG) of dependent steps, each of which must complete before the next can start. A typical pipeline includes: data ingestion (pulling raw data from sources), data validation (checking schema and statistical properties), feature engineering (transforming raw data into model inputs), training (fitting model parameters), evaluation (measuring performance against held-out data), and serving (exposing the trained model to production traffic).
Each step can fail independently. Data sources change schema. Feature distributions drift. Training jobs run out of memory. Evaluation thresholds are breached. A strong pipeline handles each failure mode with retries, alerts, and fallbacks rather than silent corruption.
The critical insight is that the training code, the part most data scientists focus on, is a small fraction of a production ML system. Google's widely cited 2015 paper on technical debt in ML systems estimated that model code accounts for roughly 5% of total system code. The remaining 95% is data collection, feature extraction, configuration, monitoring, serving infrastructure, and testing.
“Only a small fraction of real-world ML systems is composed of the ML code. The required surrounding infrastructure is vast and complex.”
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015) - Section 1: Introduction
This paper introduced the concept of ML-specific technical debt and the now-famous diagram showing model code as a tiny rectangle surrounded by massive infrastructure blocks. It reshaped how the industry thinks about ML engineering.
With an understanding of the end-to-end ml pipeline in place, the discussion can now turn to feature stores: the shared data layer, which builds directly on these foundations.
A feature store is a centralised repository for storing, versioning, and serving the engineered features that models consume. Without one, every team re-derives the same features independently: one team writes SQL to compute "average session length over 7 days," another team writes slightly different SQL for the same concept, and a third team hardcodes the logic in a Python script. The result is duplication, inconsistency, and training-serving skew, where the features used during training differ from those used during inference.
Feature stores solve this by providing a single source of truth. Features are defined once, computed on a schedule (batch) or in real time (streaming), and served to both training pipelines and inference endpoints through the same API. Feast (open source) and Tecton (managed) are the most widely adopted options. Spotify built its own internal feature store to serve the embeddings that power its recommendation models.
The organisational benefit is equally important. When a new team wants to build a model, they can browse the feature catalogue, reuse existing features, and focus on model logic rather than data engineering. Feature stores turn ML from artisanal one-off work into an engineering discipline with shared, governed components.
Common misconception
“Feature stores are only needed at big-tech scale.”
Training-serving skew is the most common source of silent production failures in ML systems of any size. Even a team of three data scientists benefits from a shared feature definition that guarantees consistency between training and inference. Feast can run on a single machine with a local file store. The investment is proportional to team size; the benefit is universal.
With an understanding of feature stores: the shared data layer in place, the discussion can now turn to experiment tracking: knowing what you tried, which builds directly on these foundations.
ML development is inherently experimental. A typical project involves hundreds of training runs with different hyperparameters, feature sets, data subsets, and model architectures. Without systematic tracking, the team cannot answer basic questions: which run produced the best F1 score? What hyperparameters were used? Which version of the training data was active?
MLflow (open source, now part of the Linux Foundation) provides experiment tracking, model packaging, and a model registry. Each run logs parameters, metrics, and artefacts (model files, plots, configuration). Runs are grouped into experiments and can be compared in a tabular UI.
Weights & Biases (W&B) provides similar capabilities with a hosted dashboard, real-time training visualisation, hyperparameter sweep orchestration, and collaborative annotation. W&B is popular in research labs and has become the de facto standard for deep learning experimentation.
The common thread is reproducibility. If you cannot reconstruct the exact conditions that produced a result, you cannot trust that result. Experiment tracking makes the implicit explicit: every decision, every parameter, every dataset version is recorded automatically.
With an understanding of experiment tracking: knowing what you tried in place, the discussion can now turn to pipeline orchestration: airflow, kubeflow, and prefect, which builds directly on these foundations.
An ML pipeline is a DAG of tasks with dependencies. Orchestration tools schedule these tasks, manage retries on failure, parallelise independent steps, and provide visibility into what ran, when, and whether it succeeded.
Apache Airflow is the most widely deployed orchestrator. Originally built at Airbnb, it defines DAGs in Python and runs tasks on a schedule or in response to triggers. Airflow excels at batch data pipelines and is well understood by data engineering teams. Its limitation is that it was designed for data engineering, not ML specifically: it does not natively understand GPU allocation, distributed training, or model artefact management.
Kubeflow Pipelines runs on Kubernetes and was designed for ML from the start. Each pipeline step runs in a container, making it straightforward to allocate GPUs, scale horizontally, and reproduce environments exactly. The trade-off is operational complexity: you need a Kubernetes cluster and the expertise to manage it.
Prefect offers a middle ground. It provides Airflow-like scheduling with a more modern Python API, better error handling, and a managed cloud option that removes infrastructure burden. Prefect is gaining adoption among smaller ML teams that want orchestration without running Kubernetes.
With an understanding of pipeline orchestration: airflow, kubeflow, and prefect in place, the discussion can now turn to the model registry: version control for models, which builds directly on these foundations.
A model registry is a versioned catalogue of trained models with metadata about each version: who trained it, what data it was trained on, its evaluation metrics, and its deployment status (staging, production, archived). It serves the same purpose for ML models that a container registry serves for Docker images or that a package registry serves for software libraries.
MLflow includes a model registry. So does Vertex AI (Google Cloud), SageMaker (AWS), and Azure ML. The key operations are: register a new model version after a successful training run, promote a version from staging to production after evaluation passes, and roll back to a previous version if the new one degrades in production.
Without a registry, deployment becomes a manual, error-prone process. Someone copies a model file to a server, hopes it is the right version, and has no systematic way to roll back. With a registry, deployment is a state transition: change the model's stage from "staging" to "production" and the serving infrastructure picks up the new version automatically.
“Model management is not optional at scale. Without versioning, lineage tracking, and promotion workflows, teams lose the ability to reason about what is running in production.”
Amershi, S. et al., 'Software Engineering for Machine Learning: A Case Study', ICSE-SEIP (2019) - Section 4: Model Management
This Microsoft Research paper studied ML practices across multiple product teams and identified model management as a critical gap. Teams that lacked a registry spent disproportionate time debugging deployment issues that version control would have prevented.
With an understanding of the model registry: version control for models in place, the discussion can now turn to monitoring and observability in production, which builds directly on these foundations.
A model that passes evaluation does not stay good forever. The world changes, and the data the model sees in production drifts away from the data it was trained on. User preferences shift. Seasonal patterns emerge. Upstream data sources change schema. This is called data drift (the input distribution changes) and concept drift (the relationship between inputs and outputs changes).
Production monitoring tracks both model-level metrics (prediction latency, error rates, throughput) and ML-specific metrics (feature distribution statistics, prediction distribution shifts, model confidence calibration). Tools like Evidently AI, WhyLabs, and Arize AI specialise in ML observability. At a minimum, you should monitor the distribution of each input feature and the distribution of model predictions. When either shifts significantly from the training distribution, it is time to investigate and potentially retrain.
Spotify monitors recommendation quality through online metrics (click-through rate, streaming time) and offline metrics (nDCG, MRR) computed on daily evaluation sets. When online metrics drop, the system can automatically roll back to the previous model version while the team investigates.
Common misconception
“Once a model is deployed, the ML work is done.”
Deployment is the beginning of a continuous feedback loop, not the end. Models degrade as the world changes beneath them. Without monitoring for data drift, concept drift, and performance degradation, a model that was excellent at deployment can become actively harmful within weeks. Production ML is an ongoing operational responsibility, not a one-time project.
A team discovers that their recommendation model performs well in offline evaluation but poorly in production. Investigation reveals that the features computed during training use batch-aggregated statistics, while the serving path computes features in real time with slightly different logic. What is this problem called?
Your company runs 50 ML models in production. A junior engineer proposes deploying new model versions by copying files to the production server via SCP. What is the primary risk of this approach?
An ML platform team is choosing between Apache Airflow and Kubeflow Pipelines for orchestrating their training workflows. The team has strong Python skills but no Kubernetes expertise, and they run batch training jobs on a small GPU cluster. Which recommendation is most appropriate?
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015)
Full paper
Introduced the concept of ML-specific technical debt and the widely cited diagram showing that model code is a tiny fraction of a production ML system. Foundational reference for ML system design.
Amershi, S. et al., 'Software Engineering for Machine Learning: A Case Study', ICSE-SEIP (2019)
Section 4: Model Management
Microsoft Research study of ML engineering practices across product teams. Identified model management, data management, and testing as the three most critical engineering gaps in production ML.
Zaharia, M. et al., 'Accelerating the Machine Learning Lifecycle with MLflow', IEEE Data Engineering Bulletin (2018)
Sections 2-4
Describes the design of MLflow, the most widely adopted open-source experiment tracking and model registry platform. Explains the tracking, projects, and models abstractions.
Full article
One of the first detailed public descriptions of an end-to-end ML platform at scale. Describes how Uber built feature stores, training pipelines, and model serving to support thousands of models.
Polyzotis, N. et al., 'Data Lifecycle Challenges in Production Machine Learning', ACM SIGMOD Record (2018)
Sections 3-5
Analyses data management challenges specific to ML pipelines, including data validation, feature engineering, and the concept of data as a first-class engineering artefact. Used for the data validation and feature store sections.
You now understand the architecture of a production ML system: pipelines, feature stores, experiment tracking, orchestration, model registries, and monitoring. The next question is economic: how do you run these systems without the infrastructure bill consuming the entire budget? Module 18 covers GPU optimisation, knowledge distillation, quantisation, and the techniques that make large models affordable to serve.
Module 17 of 24 · AI Practice & Strategy