Loading lesson...
Loading lesson...
This is the fifth of 8 Applied modules. You have built, fine-tuned, and evaluated models in previous modules. Now the question shifts from "does my model work?" to "does my model keep working after deployment?" The operational discipline covered here separates prototypes that impress in a notebook from systems that deliver value in production (24 modules total).

Real-world failure · November 2021
Zillow Offers used a machine learning model to predict home prices and buy properties algorithmically. In controlled evaluation, the model appeared accurate. In production, the housing market shifted during the pandemic: supply constraints, remote work migration patterns, and bidding wars broke the statistical relationships the model had learned.
The model continued to buy homes at prices it predicted would rise. They did not. Zillow wrote down $569 million in losses, laid off 25% of its workforce, and shut down the iBuying division entirely in November 2021.
The failure was not in model training. The failure was in deployment: no monitoring detected that input distributions had shifted, no circuit breaker paused purchasing when predictions diverged from market reality, and no A/B test validated that the model's real-world decisions were producing the expected outcomes.
Your model scores well in testing. What could possibly go wrong in production?
Zillow's story illustrates the central lesson of MLOps: a model that works in evaluation can fail catastrophically in production if you do not monitor it, version it, and test it continuously. This module covers the infrastructure and practices that prevent that outcome.
With the learning outcomes established, this module begins by examining batch versus real-time serving in depth.
Batch serving runs inference on a schedule. You process all new data at a fixed interval (hourly, daily, nightly) and store the predictions for later retrieval. Recommendation engines on e-commerce sites commonly use batch serving: they precompute recommendations for every user overnight and serve them from a cache the next day. Latency does not matter because the prediction is available before the user requests it.
Real-time serving runs inference at request time. When a user sends a query, the model computes and returns a prediction immediately. Fraud detection must operate in real time: the model sees a transaction, makes a decision in milliseconds, and either approves or blocks the charge before the merchant processes it. Latency is critical. Every additional millisecond in the inference path risks customer abandonment or regulatory non-compliance.
The choice is not always binary. Many production systems use a hybrid approach: batch-computed features (like a user's average spend over the past 30 days) are combined with real-time signals (like the current transaction amount and location) at inference time. This pattern gives you the cost efficiency of batch computation for stable features and the responsiveness of real-time inference for volatile ones.
“The hard part of machine learning is not building models. It is building production systems around models.”
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015) - Section 1: Introduction
This paper, authored by Google engineers, established that ML systems in production accumulate technical debt at an alarming rate. The model itself is a small fraction of the total system; serving infrastructure, monitoring, data pipelines, and configuration management dominate the engineering effort.
With an understanding of batch versus real-time serving in place, the discussion can now turn to data drift and concept drift, which builds directly on these foundations.
Data drift (also called covariate shift) occurs when the distribution of input features changes over time. The model was trained on data that looked one way; production data now looks different. A spam filter trained on 2020 emails will see data drift by 2024 because the vocabulary, structure, and tactics of spam evolve continuously.
Concept drift occurs when the relationship between inputs and outputs changes. The features might look the same, but what they predict has shifted. Zillow's model experienced concept drift: the features (location, square footage, recent sales) still existed, but their relationship to price changed as pandemic dynamics rewrote the housing market. The model's learned function was correct for the old world and wrong for the new one.
Detecting drift requires monitoring. Statistical tests (Population Stability Index, Kolmogorov-Smirnov test, Jensen-Shannon divergence) compare production input distributions against training distributions. Performance monitoring tracks prediction accuracy against ground-truth labels as they become available. The combination of input monitoring and output monitoring catches both types of drift. Neither alone is sufficient.
Common misconception
“If my model's accuracy was high in testing, it will stay high in production.”
Test accuracy is a snapshot. Production is a movie. Input distributions shift, user behaviour changes, the world evolves, and competitors adapt. Without continuous monitoring for data drift and concept drift, you will not know your model is failing until the business impact is undeniable. Zillow's model tested well. It lost $569 million in production because nobody was watching the drift.
With an understanding of data drift and concept drift in place, the discussion can now turn to a/b testing for models, which builds directly on these foundations.
Offline evaluation tells you how a model performs on historical data. A/B testing tells you how it performs on live users. Traffic is split between the current production model (control) and a candidate model (treatment). Business metrics (click-through rate, conversion, revenue, error rate) are compared with statistical significance tests.
A/B testing catches problems that offline evaluation cannot. A recommendation model might score well on historical click data but reduce engagement when deployed because it surfaces content users have already seen. Only a live experiment reveals this because the model's recommendations change the very data it will be evaluated on (a feedback loop).
Common pitfalls include running tests for too short a period (novelty effects inflate early results), contaminating the control group (users in the treatment group influence users in the control group through social sharing), and ignoring guardrail metrics (the candidate model improves click-through rate but increases customer support tickets).
With an understanding of a/b testing for models in place, the discussion can now turn to feature stores, which builds directly on these foundations.
A feature store is a centralised repository for storing, versioning, and serving the engineered features that models consume. Without one, every team recomputes the same features independently, often with subtle inconsistencies between the training pipeline and the serving pipeline (training-serving skew).
Training-serving skew is one of the most insidious bugs in ML systems. The feature "average transaction amount over 30 days" might be computed slightly differently in the training SQL query and the serving Java code. The model trained on one definition and serves with another. Performance degrades silently because the feature values are close enough to avoid obvious errors but different enough to shift predictions.
Feature stores solve this by providing a single computation definition used in both training and serving. They also enable feature reuse across teams: if the fraud team already computes "user account age in days," the recommendations team can use the same feature without re-implementing it. Major implementations include Feast (open source), Tecton, and Hopsworks.
With an understanding of feature stores in place, the discussion can now turn to model registries, which builds directly on these foundations.
A model registry is to ML models what a container registry is to Docker images. It stores versioned model artefacts along with metadata: the training data hash, hyperparameters, evaluation metrics, who trained it, and when. When a production incident occurs, the registry enables instant rollback to the previous model version. Without one, rolling back means scrambling to find the right checkpoint file on someone's laptop.
Model registries also enforce governance. Before a model can be promoted from "staging" to "production," it must pass automated checks: evaluation metrics exceed a threshold, bias audits are clean, and a human reviewer has approved it. This promotion workflow is the ML equivalent of a code review and merge process.
MLflow Model Registry and Weights & Biases are widely used. Cloud providers offer integrated registries: Amazon SageMaker Model Registry, Azure ML Model Registry, and Google Vertex AI Model Registry. The choice depends on your existing infrastructure, but the principle is the same: never deploy a model you cannot trace back to its training data and configuration.
“Only a tiny fraction of real-world ML systems is composed of the ML code. The required surrounding infrastructure is vast and complex.”
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015) - Figure 1
The famous 'ML code as a small rectangle surrounded by vast infrastructure' diagram from this paper. Feature stores, model registries, monitoring, and serving infrastructure are not optional extras; they are the majority of the system.
Common misconception
“MLOps is just DevOps with a different name.”
DevOps manages code artefacts. MLOps manages code, data, and model artefacts, all of which can change independently and all of which affect system behaviour. A code deployment is deterministic: the same code produces the same behaviour. A model deployment depends on the data it was trained on, the features it receives, and the distribution of production inputs. This additional complexity is why MLOps requires specialised tooling: feature stores, model registries, drift monitors, and experiment trackers that have no equivalent in traditional software engineering.
A fraud detection system must decide whether to approve or block a credit card transaction in under 100 milliseconds. Which serving strategy is appropriate?
Zillow's iBuying model was trained on pre-pandemic housing data and deployed during the pandemic. The features (location, square footage) were still available, but the price predictions were wrong. What type of drift occurred?
A team trains their model using a SQL query that computes 'average_spend_30d' by including the current day's transactions. The serving code computes the same feature but excludes the current day. What problem does this create?
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015)
Full paper
Foundational paper establishing that ML systems accumulate technical debt through data dependencies, configuration, and monitoring gaps. Introduced the concept that ML code is a small fraction of a production system.
Parker, K., 'Zillow's Home-Flipping Debacle Shows Limits of AI', Bloomberg (2021)
Full article
Primary reporting on Zillow's $569 million iBuying loss. Documents how the pricing model failed to adapt to pandemic-era market dynamics and lacked adequate monitoring infrastructure.
Gama, J. et al., 'A Survey on Concept Drift Adaptation', ACM Computing Surveys (2014)
Sections 2-4
thorough taxonomy of drift types (sudden, gradual, incremental, recurring) and detection methods. Establishes the theoretical framework for understanding why deployed models degrade over time.
Section III: MLOps Architecture
Systematic review of MLOps principles, roles, and architecture patterns. Defines the canonical pipeline from experimentation through deployment to monitoring.
Sections on CT/CD/CM
Industry reference defining continuous training, continuous delivery, and continuous monitoring as the three pillars of MLOps maturity. Informed by Google's decade of production ML experience.
You now understand how to deploy models reliably and monitor them in production. But production systems face more than drift. They face adversaries. The next module examines the security threats that target ML systems specifically: prompt injection, data poisoning, adversarial examples, and model extraction attacks.