Practice and strategy · Module 2

Scaling, cost and reliability in AI systems

Scaling is not a single knob.

1.7h 4 outcomes AI Advanced

Previously

AI systems and model architectures

A model is a component that maps inputs to outputs.

This module

Scaling, cost and reliability in AI systems

Scaling is not a single knob.

Next

Evaluation, monitoring and governance in production AI

Evaluation in production is not a single score.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

Glossary Tip.

What you will be able to do

  • 1 Explain scaling, cost and reliability in ai systems in your own words and apply it to a realistic scenario.
  • 2 Scaling increases capability and cost. Reliability comes from limits and budgets.
  • 3 Check the assumption "Budgets are set" and explain what changes if it is false.
  • 4 Check the assumption "Failure is expected" and explain what changes if it is false.

Before you begin

  • Comfort with earlier modules in this track
  • Ability to explain trade-offs and risks without jargon

Common ways people get this wrong

  • Unbounded usage. Without budgets and limits, the system becomes expensive and unstable.
  • Capacity surprises. Spikes happen. If you cannot absorb spikes safely, you will fail under success.

Main idea at a glance

Scaling and failure in AI systems

Assume load and failure. Design the cache and fallback paths early.

Stage 1

Traffic increases

User or system load increases beyond what the current capacity handles.

I think most teams do not plan for this early enough.

Scale planning should include failure and fallback behaviour from day one.

Scaling is not a single knob. It is a set of constraints you discover when traffic hits. The first surprise is that models are only one part of the system. Data fetches, feature generation, retrieval calls, and post processing often become the bottleneck before the model does.

When people say scale, they usually mean capacity. horizontal scaling Vertical scaling is making a single instance bigger, like more CPU, more GPU, or more memory. In AI systems, vertical scaling buys you headroom per request. Horizontal scaling buys you throughput and fault tolerance. You usually need both, but you should start by making the service stateless so scaling is possible.

Interactive lab

Glossary Tip

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

Stateless inference services are easier to deploy and easier to recover. You can restart a bad instance without losing user state. You can spread load across instances without sticky sessions. If you need state, push it to a data store with clear ownership and clear timeouts.

Two levers help before you add more compute: caching and batching. Caching is serving repeated requests from memory or a fast store. Batching is grouping multiple requests into one model call so you use hardware more efficiently. Both reduce cost, but both can change behaviour. Caching can serve stale answers. Batching can increase latency for the first request in the batch. These are product decisions, not just infrastructure choices.

Scaling data can hurt more than scaling models. Retrieval adds network calls, index lookups, and permission checks. Feature pipelines add joins, transformations, and schema drift. If your data path is slow, scaling the model will not fix it. You are just making the wrong part faster.

Cost shows up as cost per request, and the hidden multiplier is how many steps your system takes per request. One model call is expensive. A model call plus retrieval, plus reranking, plus embeddings, plus retries can turn a single user action into a small workflow. The user sees one answer. You pay for the whole graph.

Embeddings and retrieval have their own cost curve. Embeddings can be expensive to compute and expensive to store at scale. Retrieval can be cheap per query and still expensive overall because it runs for every request. The easiest way to lose money is to add a pipeline step that runs always, even when it only helps sometimes.

Overprovisioning is the second way to lose money. Keeping GPUs warm for peak traffic feels safe, but it can make costs explode if the peak is rare. Autoscaling helps, but cold starts can hurt latency. This is why you trade off accuracy, latency, and cost as a trio. A slower model might be more accurate. A faster model might be cheaper. A cached answer might be good enough most of the time.

Reliability is where the real lessons are. Models fail in boring ways. They time out. They return empty output. They return something plausible and wrong. They get slower under load. They hit rate limits from an upstream provider. Partial failures are normal in distributed systems, so assume they will happen.

Retries are useful, but they are dangerous under load. If every request times out and you retry immediately, you double traffic into a system that is already struggling. This is why you need a retry policy.

Interactive lab

Glossary Tip

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

A good retry policy uses backoff and caps. It retries only when it makes sense, like transient network issues, not when a model is overloaded. It also needs per request budgets, so retries do not turn a slow failure into an outage.

The practical goal is graceful degradation.

Interactive lab

Glossary Tip

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

That can mean switching to a smaller model, skipping retrieval, returning a cached answer, or showing a clear fallback message. The key is that failure is expected, not exceptional. Your architecture should make the fallback path explicit and testable.

What I audit first when someone says “we are production ready”

I do not start with the model card. I start with the boring bits: rate limiting, authentication, logging, and rollbacks. If those are weak, the rest is theatre.

Here is the quick list I use: Can we turn it off. Can we fall back. Can we explain what happened. Can we prove what data it used. Can we prove who accessed what. If the answer is no, it is not production ready. It might be impressive, but it is not safe.

My production-readiness audit sequence

  1. Can we turn it off safely

    A kill switch is mandatory when user harm or system instability appears.

  2. Can we fall back safely

    Define the reduced-mode experience before incidents happen.

  3. Can we explain what happened

    Keep enough logs and trace identifiers to reconstruct decisions.

  4. Can we prove data use and access controls

    Capture data provenance and permission evidence for audits and incident response.

Mental model

Scale is a trade-off

Scaling increases capability and cost. Reliability comes from limits and budgets.

  1. 1

    Scale up

  2. 2

    Cost

  3. 3

    Latency

  4. 4

    Reliability

Assumptions to keep in mind

  • Budgets are set. Budgets for latency and cost stop ‘just one more token’ from becoming a system habit.
  • Failure is expected. Services fail. A reliable system degrades gracefully instead of collapsing.

Failure modes to notice

  • Unbounded usage. Without budgets and limits, the system becomes expensive and unstable.
  • Capacity surprises. Spikes happen. If you cannot absorb spikes safely, you will fail under success.

Key terms

horizontal scaling
Horizontal scaling is adding more instances of a service to handle more load.

Check yourself

Check your understanding of scaling and reliability

0 of 10 opened

What is horizontal scaling

Adding more service instances to handle more load.

Why do stateless services scale more easily

Because any instance can handle any request and instances can restart without losing user state.

Name two levers that can reduce cost before adding more compute

Caching and batching.

Why can scaling data hurt more than scaling models

Because retrieval and feature pipelines can add slow network and transformation work that becomes the real bottleneck.

What is a common hidden cost driver in AI systems

Extra per request steps like retrieval, embeddings, reranking, and retries.

Scenario. An upstream provider rate limits you and latency spikes. What is a sensible reliability response

Gracefully degrade: use a cached answer, a smaller model, or a safe message, and back off retries. Do not hammer the dependency.

Why can retries be dangerous under load

They add traffic to a system that is already struggling and can turn slowdowns into outages.

What should a retry policy define

When to retry, how many times, and how long to wait between attempts.

What is graceful degradation

Providing a safe, simpler experience when parts fail, such as a fallback model or cached answer.

Why should failure be treated as expected in architecture

Because distributed dependencies and model services fail in normal ways, so fallback paths must be explicit and testable.

Artefact and reflection

Artefact

A concise design or governance brief that can be reviewed by a team

Reflection

Where in your work would explain scaling, cost and reliability in ai systems in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?

Optional practice

Step through the noise-to-image process that powers generative AI image models and see how denoising works.