Practice and strategy · Module 2
Scaling, cost and reliability in AI systems
Scaling is not a single knob.
Previously
AI systems and model architectures
A model is a component that maps inputs to outputs.
This module
Scaling, cost and reliability in AI systems
Scaling is not a single knob.
Next
Evaluation, monitoring and governance in production AI
Evaluation in production is not a single score.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
Glossary Tip.
What you will be able to do
- 1 Explain scaling, cost and reliability in ai systems in your own words and apply it to a realistic scenario.
- 2 Scaling increases capability and cost. Reliability comes from limits and budgets.
- 3 Check the assumption "Budgets are set" and explain what changes if it is false.
- 4 Check the assumption "Failure is expected" and explain what changes if it is false.
Before you begin
- Comfort with earlier modules in this track
- Ability to explain trade-offs and risks without jargon
Common ways people get this wrong
- Unbounded usage. Without budgets and limits, the system becomes expensive and unstable.
- Capacity surprises. Spikes happen. If you cannot absorb spikes safely, you will fail under success.
Main idea at a glance
Scaling and failure in AI systems
Assume load and failure. Design the cache and fallback paths early.
Stage 1
Traffic increases
User or system load increases beyond what the current capacity handles.
I think most teams do not plan for this early enough.
Scale planning should include failure and fallback behaviour from day one.
Scaling is not a single knob. It is a set of constraints you discover when traffic hits. The first surprise is that models are only one part of the system. Data fetches, feature generation, retrieval calls, and post processing often become the bottleneck before the model does.
When people say scale, they usually mean capacity. horizontal scaling Vertical scaling is making a single instance bigger, like more CPU, more GPU, or more memory. In AI systems, vertical scaling buys you headroom per request. Horizontal scaling buys you throughput and fault tolerance. You usually need both, but you should start by making the service stateless so scaling is possible.
Interactive lab
Glossary Tip
This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.
Stateless inference services are easier to deploy and easier to recover. You can restart a bad instance without losing user state. You can spread load across instances without sticky sessions. If you need state, push it to a data store with clear ownership and clear timeouts.
Two levers help before you add more compute: caching and batching. Caching is serving repeated requests from memory or a fast store. Batching is grouping multiple requests into one model call so you use hardware more efficiently. Both reduce cost, but both can change behaviour. Caching can serve stale answers. Batching can increase latency for the first request in the batch. These are product decisions, not just infrastructure choices.
Scaling data can hurt more than scaling models. Retrieval adds network calls, index lookups, and permission checks. Feature pipelines add joins, transformations, and schema drift. If your data path is slow, scaling the model will not fix it. You are just making the wrong part faster.
Cost shows up as cost per request, and the hidden multiplier is how many steps your system takes per request. One model call is expensive. A model call plus retrieval, plus reranking, plus embeddings, plus retries can turn a single user action into a small workflow. The user sees one answer. You pay for the whole graph.
Embeddings and retrieval have their own cost curve. Embeddings can be expensive to compute and expensive to store at scale. Retrieval can be cheap per query and still expensive overall because it runs for every request. The easiest way to lose money is to add a pipeline step that runs always, even when it only helps sometimes.
Overprovisioning is the second way to lose money. Keeping GPUs warm for peak traffic feels safe, but it can make costs explode if the peak is rare. Autoscaling helps, but cold starts can hurt latency. This is why you trade off accuracy, latency, and cost as a trio. A slower model might be more accurate. A faster model might be cheaper. A cached answer might be good enough most of the time.
Reliability is where the real lessons are. Models fail in boring ways. They time out. They return empty output. They return something plausible and wrong. They get slower under load. They hit rate limits from an upstream provider. Partial failures are normal in distributed systems, so assume they will happen.
Retries are useful, but they are dangerous under load. If every request times out and you retry immediately, you double traffic into a system that is already struggling. This is why you need a retry policy.
Interactive lab
Glossary Tip
This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.
A good retry policy uses backoff and caps. It retries only when it makes sense, like transient network issues, not when a model is overloaded. It also needs per request budgets, so retries do not turn a slow failure into an outage.
The practical goal is graceful degradation.
Interactive lab
Glossary Tip
This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.
That can mean switching to a smaller model, skipping retrieval, returning a cached answer, or showing a clear fallback message. The key is that failure is expected, not exceptional. Your architecture should make the fallback path explicit and testable.
What I audit first when someone says “we are production ready”
I do not start with the model card. I start with the boring bits: rate limiting, authentication, logging, and rollbacks. If those are weak, the rest is theatre.
Here is the quick list I use: Can we turn it off. Can we fall back. Can we explain what happened. Can we prove what data it used. Can we prove who accessed what. If the answer is no, it is not production ready. It might be impressive, but it is not safe.
My production-readiness audit sequence
-
Can we turn it off safely
A kill switch is mandatory when user harm or system instability appears.
-
Can we fall back safely
Define the reduced-mode experience before incidents happen.
-
Can we explain what happened
Keep enough logs and trace identifiers to reconstruct decisions.
-
Can we prove data use and access controls
Capture data provenance and permission evidence for audits and incident response.
Mental model
Scale is a trade-off
Scaling increases capability and cost. Reliability comes from limits and budgets.
-
1
Scale up
-
2
Cost
-
3
Latency
-
4
Reliability
Assumptions to keep in mind
- Budgets are set. Budgets for latency and cost stop ‘just one more token’ from becoming a system habit.
- Failure is expected. Services fail. A reliable system degrades gracefully instead of collapsing.
Failure modes to notice
- Unbounded usage. Without budgets and limits, the system becomes expensive and unstable.
- Capacity surprises. Spikes happen. If you cannot absorb spikes safely, you will fail under success.
Key terms
- horizontal scaling
- Horizontal scaling is adding more instances of a service to handle more load.
Check yourself
Check your understanding of scaling and reliability
0 of 10 opened
What is horizontal scaling
Adding more service instances to handle more load.
Why do stateless services scale more easily
Because any instance can handle any request and instances can restart without losing user state.
Name two levers that can reduce cost before adding more compute
Caching and batching.
Why can scaling data hurt more than scaling models
Because retrieval and feature pipelines can add slow network and transformation work that becomes the real bottleneck.
What is a common hidden cost driver in AI systems
Extra per request steps like retrieval, embeddings, reranking, and retries.
Scenario. An upstream provider rate limits you and latency spikes. What is a sensible reliability response
Gracefully degrade: use a cached answer, a smaller model, or a safe message, and back off retries. Do not hammer the dependency.
Why can retries be dangerous under load
They add traffic to a system that is already struggling and can turn slowdowns into outages.
What should a retry policy define
When to retry, how many times, and how long to wait between attempts.
What is graceful degradation
Providing a safe, simpler experience when parts fail, such as a fallback model or cached answer.
Why should failure be treated as expected in architecture
Because distributed dependencies and model services fail in normal ways, so fallback paths must be explicit and testable.
Artefact and reflection
Artefact
A concise design or governance brief that can be reviewed by a team
Reflection
Where in your work would explain scaling, cost and reliability in ai systems in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?
Optional practice
Step through the noise-to-image process that powers generative AI image models and see how denoising works.