Module 13 of 24 · Applied

Deployment and MLOps

40 min read 3 outcomes 1 interactive tool + drag challenge 5 standards cited

This is the fifth of 8 Applied modules. You have built, fine-tuned, and evaluated models in previous modules. Now the question shifts from "does my model work?" to "does my model keep working after deployment?" The operational discipline covered here separates prototypes that impress in a notebook from systems that deliver value in production (24 modules total).

By the end of this module you will be able to:

Compare batch and real-time serving strategies and choose the right one for a given use case
Define data drift and concept drift, explain how each degrades model performance, and describe monitoring strategies to detect them
Explain the roles of feature stores, model registries, and A/B testing in a mature MLOps pipeline

Row of suburban houses representing the housing market that Zillow tried to predict

Real-world failure · November 2021

Zillow bet $569 million on a model that could not survive production.

Zillow Offers used a machine learning model to predict home prices and buy properties algorithmically. In controlled evaluation, the model appeared accurate. In production, the housing market shifted during the pandemic: supply constraints, remote work migration patterns, and bidding wars broke the statistical relationships the model had learned.

The model continued to buy homes at prices it predicted would rise. They did not. Zillow wrote down $569 million in losses, laid off 25% of its workforce, and shut down the iBuying division entirely in November 2021.

The failure was not in model training. The failure was in deployment: no monitoring detected that input distributions had shifted, no circuit breaker paused purchasing when predictions diverged from market reality, and no A/B test validated that the model's real-world decisions were producing the expected outcomes.

Your model scores well in testing. What could possibly go wrong in production?

Zillow's story illustrates the central lesson of MLOps: a model that works in evaluation can fail catastrophically in production if you do not monitor it, version it, and test it continuously. This module covers the infrastructure and practices that prevent that outcome.

With the learning outcomes established, this module begins by examining batch versus real-time serving in depth.

13.1 Batch versus real-time serving

Batch serving runs inference on a schedule. You process all new data at a fixed interval (hourly, daily, nightly) and store the predictions for later retrieval. Recommendation engines on e-commerce sites commonly use batch serving: they precompute recommendations for every user overnight and serve them from a cache the next day. Latency does not matter because the prediction is available before the user requests it.

Real-time serving runs inference at request time. When a user sends a query, the model computes and returns a prediction immediately. Fraud detection must operate in real time: the model sees a transaction, makes a decision in milliseconds, and either approves or blocks the charge before the merchant processes it. Latency is critical. Every additional millisecond in the inference path risks customer abandonment or regulatory non-compliance.

The choice is not always binary. Many production systems use a hybrid approach: batch-computed features (like a user's average spend over the past 30 days) are combined with real-time signals (like the current transaction amount and location) at inference time. This pattern gives you the cost efficiency of batch computation for stable features and the responsiveness of real-time inference for volatile ones.

“The hard part of machine learning is not building models. It is building production systems around models.”
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015) - Section 1: Introduction
This paper, authored by Google engineers, established that ML systems in production accumulate technical debt at an alarming rate. The model itself is a small fraction of the total system; serving infrastructure, monitoring, data pipelines, and configuration management dominate the engineering effort.

With an understanding of batch versus real-time serving in place, the discussion can now turn to data drift and concept drift, which builds directly on these foundations.

13.2 Data drift and concept drift

Data drift (also called covariate shift) occurs when the distribution of input features changes over time. The model was trained on data that looked one way; production data now looks different. A spam filter trained on 2020 emails will see data drift by 2024 because the vocabulary, structure, and tactics of spam evolve continuously.

Concept drift occurs when the relationship between inputs and outputs changes. The features might look the same, but what they predict has shifted. Zillow's model experienced concept drift: the features (location, square footage, recent sales) still existed, but their relationship to price changed as pandemic dynamics rewrote the housing market. The model's learned function was correct for the old world and wrong for the new one.

Detecting drift requires monitoring. Statistical tests (Population Stability Index, Kolmogorov-Smirnov test, Jensen-Shannon divergence) compare production input distributions against training distributions. Performance monitoring tracks prediction accuracy against ground-truth labels as they become available. The combination of input monitoring and output monitoring catches both types of drift. Neither alone is sufficient.

Common misconception

“If my model's accuracy was high in testing, it will stay high in production.”

Test accuracy is a snapshot. Production is a movie. Input distributions shift, user behaviour changes, the world evolves, and competitors adapt. Without continuous monitoring for data drift and concept drift, you will not know your model is failing until the business impact is undeniable. Zillow's model tested well. It lost $569 million in production because nobody was watching the drift.

With an understanding of data drift and concept drift in place, the discussion can now turn to a/b testing for models, which builds directly on these foundations.

13.3 A/B testing for models

Offline evaluation tells you how a model performs on historical data. A/B testing tells you how it performs on live users. Traffic is split between the current production model (control) and a candidate model (treatment). Business metrics (click-through rate, conversion, revenue, error rate) are compared with statistical significance tests.

A/B testing catches problems that offline evaluation cannot. A recommendation model might score well on historical click data but reduce engagement when deployed because it surfaces content users have already seen. Only a live experiment reveals this because the model's recommendations change the very data it will be evaluated on (a feedback loop).

Common pitfalls include running tests for too short a period (novelty effects inflate early results), contaminating the control group (users in the treatment group influence users in the control group through social sharing), and ignoring guardrail metrics (the candidate model improves click-through rate but increases customer support tickets).

With an understanding of a/b testing for models in place, the discussion can now turn to feature stores, which builds directly on these foundations.

Production ML infrastructure showing serving endpoints, monitoring dashboards, feature stores, and retraining pipelines — Production ML systems require infrastructure far beyond the model itself: serving endpoints, monitoring dashboards, feature stores, model registries, and automated retraining pipelines.

13.4 Feature stores

A feature store is a centralised repository for storing, versioning, and serving the engineered features that models consume. Without one, every team recomputes the same features independently, often with subtle inconsistencies between the training pipeline and the serving pipeline (training-serving skew).

Training-serving skew is one of the most insidious bugs in ML systems. The feature "average transaction amount over 30 days" might be computed slightly differently in the training SQL query and the serving Java code. The model trained on one definition and serves with another. Performance degrades silently because the feature values are close enough to avoid obvious errors but different enough to shift predictions.

Feature stores solve this by providing a single computation definition used in both training and serving. They also enable feature reuse across teams: if the fraud team already computes "user account age in days," the recommendations team can use the same feature without re-implementing it. Major implementations include Feast (open source), Tecton, and Hopsworks.

With an understanding of feature stores in place, the discussion can now turn to model registries, which builds directly on these foundations.

13.5 Model registries

A model registry is to ML models what a container registry is to Docker images. It stores versioned model artefacts along with metadata: the training data hash, hyperparameters, evaluation metrics, who trained it, and when. When a production incident occurs, the registry enables instant rollback to the previous model version. Without one, rolling back means scrambling to find the right checkpoint file on someone's laptop.

Model registries also enforce governance. Before a model can be promoted from "staging" to "production," it must pass automated checks: evaluation metrics exceed a threshold, bias audits are clean, and a human reviewer has approved it. This promotion workflow is the ML equivalent of a code review and merge process.

MLflow Model Registry and Weights & Biases are widely used. Cloud providers offer integrated registries: Amazon SageMaker Model Registry, Azure ML Model Registry, and Google Vertex AI Model Registry. The choice depends on your existing infrastructure, but the principle is the same: never deploy a model you cannot trace back to its training data and configuration.

“Only a tiny fraction of real-world ML systems is composed of the ML code. The required surrounding infrastructure is vast and complex.”
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015) - Figure 1
The famous 'ML code as a small rectangle surrounded by vast infrastructure' diagram from this paper. Feature stores, model registries, monitoring, and serving infrastructure are not optional extras; they are the majority of the system.

Common misconception

“MLOps is just DevOps with a different name.”

DevOps manages code artefacts. MLOps manages code, data, and model artefacts, all of which can change independently and all of which affect system behaviour. A code deployment is deterministic: the same code produces the same behaviour. A model deployment depends on the data it was trained on, the features it receives, and the distribution of production inputs. This additional complexity is why MLOps requires specialised tooling: feature stores, model registries, drift monitors, and experiment trackers that have no equivalent in traditional software engineering.

Loading interactive component...

13.6 Check your understanding

A fraud detection system must decide whether to approve or block a credit card transaction in under 100 milliseconds. Which serving strategy is appropriate?

Zillow's iBuying model was trained on pre-pandemic housing data and deployed during the pandemic. The features (location, square footage) were still available, but the price predictions were wrong. What type of drift occurred?

A team trains their model using a SQL query that computes 'average_spend_30d' by including the current day's transactions. The serving code computes the same feature but excludes the current day. What problem does this create?

Loading interactive component...

Key takeaways

Batch serving pre-computes predictions on a schedule and is cost-effective for stable, latency-tolerant workloads like recommendations. Real-time serving computes predictions at request time and is essential for latency-sensitive decisions like fraud detection. Most production systems use a hybrid of both.
Data drift is a change in input distributions; concept drift is a change in the relationship between inputs and outputs. Both degrade model performance silently. Zillow experienced concept drift when pandemic dynamics broke the learned relationship between property features and prices.
A/B testing is the only reliable way to measure how a model performs on live users. Offline evaluation on historical data cannot capture feedback loops, novelty effects, or the causal impact of model decisions on user behaviour.
Feature stores eliminate training-serving skew by providing a single feature definition used in both pipelines. Without one, subtle computational differences between training and serving code cause silent performance degradation that is extremely difficult to diagnose.
Model registries version model artefacts with full lineage (training data, hyperparameters, metrics, approvals). They enable instant rollback during incidents and enforce governance by requiring automated checks before promotion to production.

Standards and sources cited in this module

Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015)
Full paper
Foundational paper establishing that ML systems accumulate technical debt through data dependencies, configuration, and monitoring gaps. Introduced the concept that ML code is a small fraction of a production system.
Parker, K., 'Zillow's Home-Flipping Debacle Shows Limits of AI', Bloomberg (2021)
Full article
Primary reporting on Zillow's $569 million iBuying loss. Documents how the pricing model failed to adapt to pandemic-era market dynamics and lacked adequate monitoring infrastructure.
Gama, J. et al., 'A Survey on Concept Drift Adaptation', ACM Computing Surveys (2014)
Sections 2-4
thorough taxonomy of drift types (sudden, gradual, incremental, recurring) and detection methods. Establishes the theoretical framework for understanding why deployed models degrade over time.
Kreuzberger, D. et al., 'Machine Learning Operations (MLOps): Overview, Definition, and Architecture', IEEE Access (2023)
Section III: MLOps Architecture
Systematic review of MLOps principles, roles, and architecture patterns. Defines the canonical pipeline from experimentation through deployment to monitoring.
Google, 'Practitioners Guide to MLOps: A Framework for Continuous Delivery and Automation of ML' (2021)
Sections on CT/CD/CM
Industry reference defining continuous training, continuous delivery, and continuous monitoring as the three pillars of MLOps maturity. Informed by Google's decade of production ML experience.

You now understand how to deploy models reliably and monitor them in production. But production systems face more than drift. They face adversaries. The next module examines the security threats that target ML systems specifically: prompt injection, data poisoning, adversarial examples, and model extraction attacks.

Loading lesson...

Module 13 of 24 · Applied

Deployment and MLOps

40 min read 3 outcomes 1 interactive tool + drag challenge 5 standards cited

By the end of this module you will be able to:

Compare batch and real-time serving strategies and choose the right one for a given use case
Define data drift and concept drift, explain how each degrades model performance, and describe monitoring strategies to detect them
Explain the roles of feature stores, model registries, and A/B testing in a mature MLOps pipeline

Real-world failure · November 2021

Zillow bet $569 million on a model that could not survive production.

Your model scores well in testing. What could possibly go wrong in production?

With the learning outcomes established, this module begins by examining batch versus real-time serving in depth.

13.1 Batch versus real-time serving

“The hard part of machine learning is not building models. It is building production systems around models.”
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015) - Section 1: Introduction
This paper, authored by Google engineers, established that ML systems in production accumulate technical debt at an alarming rate. The model itself is a small fraction of the total system; serving infrastructure, monitoring, data pipelines, and configuration management dominate the engineering effort.

With an understanding of batch versus real-time serving in place, the discussion can now turn to data drift and concept drift, which builds directly on these foundations.

13.2 Data drift and concept drift

Common misconception

“If my model's accuracy was high in testing, it will stay high in production.”

With an understanding of data drift and concept drift in place, the discussion can now turn to a/b testing for models, which builds directly on these foundations.

13.3 A/B testing for models

With an understanding of a/b testing for models in place, the discussion can now turn to feature stores, which builds directly on these foundations.

13.4 Feature stores

With an understanding of feature stores in place, the discussion can now turn to model registries, which builds directly on these foundations.

13.5 Model registries

“Only a tiny fraction of real-world ML systems is composed of the ML code. The required surrounding infrastructure is vast and complex.”
Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015) - Figure 1
The famous 'ML code as a small rectangle surrounded by vast infrastructure' diagram from this paper. Feature stores, model registries, monitoring, and serving infrastructure are not optional extras; they are the majority of the system.

Common misconception

“MLOps is just DevOps with a different name.”

Loading interactive component...

13.6 Check your understanding

A fraud detection system must decide whether to approve or block a credit card transaction in under 100 milliseconds. Which serving strategy is appropriate?

Loading interactive component...

Key takeaways

Batch serving pre-computes predictions on a schedule and is cost-effective for stable, latency-tolerant workloads like recommendations. Real-time serving computes predictions at request time and is essential for latency-sensitive decisions like fraud detection. Most production systems use a hybrid of both.
Data drift is a change in input distributions; concept drift is a change in the relationship between inputs and outputs. Both degrade model performance silently. Zillow experienced concept drift when pandemic dynamics broke the learned relationship between property features and prices.
A/B testing is the only reliable way to measure how a model performs on live users. Offline evaluation on historical data cannot capture feedback loops, novelty effects, or the causal impact of model decisions on user behaviour.
Feature stores eliminate training-serving skew by providing a single feature definition used in both pipelines. Without one, subtle computational differences between training and serving code cause silent performance degradation that is extremely difficult to diagnose.
Model registries version model artefacts with full lineage (training data, hyperparameters, metrics, approvals). They enable instant rollback during incidents and enforce governance by requiring automated checks before promotion to production.

Standards and sources cited in this module

Sculley, D. et al., 'Hidden Technical Debt in Machine Learning Systems', NeurIPS (2015)
Full paper
Foundational paper establishing that ML systems accumulate technical debt through data dependencies, configuration, and monitoring gaps. Introduced the concept that ML code is a small fraction of a production system.
Parker, K., 'Zillow's Home-Flipping Debacle Shows Limits of AI', Bloomberg (2021)
Full article
Primary reporting on Zillow's $569 million iBuying loss. Documents how the pricing model failed to adapt to pandemic-era market dynamics and lacked adequate monitoring infrastructure.
Gama, J. et al., 'A Survey on Concept Drift Adaptation', ACM Computing Surveys (2014)
Sections 2-4
thorough taxonomy of drift types (sudden, gradual, incremental, recurring) and detection methods. Establishes the theoretical framework for understanding why deployed models degrade over time.
Kreuzberger, D. et al., 'Machine Learning Operations (MLOps): Overview, Definition, and Architecture', IEEE Access (2023)
Section III: MLOps Architecture
Systematic review of MLOps principles, roles, and architecture patterns. Defines the canonical pipeline from experimentation through deployment to monitoring.
Google, 'Practitioners Guide to MLOps: A Framework for Continuous Delivery and Automation of ML' (2021)
Sections on CT/CD/CM
Industry reference defining continuous training, continuous delivery, and continuous monitoring as the three pillars of MLOps maturity. Informed by Google's decade of production ML experience.