Loading lesson...
Loading lesson...

Production incident · 2024
In 2024, an AI startup operating a production agent system experienced a significant cascading failure caused by a monitoring gap that had gone undetected through months of gradual traffic growth. The system routed customer support queries through a multi-step agent that accumulated conversation history in its context window. Under normal load, conversations stayed well within the model's context limit.
As daily query volume grew, a subset of complex queries began generating unusually long tool-call sequences. Each tool response added several hundred tokens to the context. After five to seven tool calls, the context window approached its limit. The model began truncating earlier parts of the conversation silently. Without the original instructions in context, agents began producing off-format and incomplete responses. The error rate climbed gradually, staying below the alert threshold for hours.
By the time the error rate crossed the alerting threshold, dozens of customer cases had received incomplete responses. Investigation revealed that nobody had instrumented context window utilisation as a metric. The team had latency metrics, error counts, and queue depth. They had no visibility into a resource that grows with every tool call and silently degrades quality before it causes an explicit failure. The fix required adding context utilisation as a monitored signal and implementing sliding-window context management.
If your monitoring cannot detect that an agent's context window is silently filling to its limit, what else might it be missing?
Architecture describes the system; deployment makes it real. This module covers containerisation with Docker, orchestration with Kubernetes, CI/CD pipelines, and the monitoring infrastructure that keeps production agents observable and controllable.
With the learning outcomes established, this module begins by examining production readiness means designing for failure in depth.
A well-designed agent that works reliably in development can fail in production for reasons that are invisible in development: edge cases in real user inputs, traffic spikes at unexpected times, LLM API rate limits under sustained load, slow network conditions affecting tool calls, and gradual resource exhaustion of the kind described above.
Production readiness is not a feature added at the end. It is an approach built into every deployment decision: packaging the agent in a reproducible container, defining what "healthy" means before deployment, instrumenting every metric that can degrade, testing changes on a small fraction of traffic before committing, and having an automatic rollback path that does not require human intervention at 3 AM.
The question is not whether your agent will fail in production. It will. The question is whether you will know within minutes, and whether you can restore service in seconds without requiring a human to be awake.
With an understanding of production readiness means designing for failure in place, the discussion can now turn to containerisation with docker, which builds directly on these foundations.
Containerisation packages an application with all its dependencies into a single image that runs identically across development, staging, and production environments. Docker is the dominant container runtime. Containers eliminate the "works on my machine" problem that plagues manual deployments and enable CI/CD (Continuous Integration and Continuous Deployment) pipelines that build, test, and deploy automatically.
A production Dockerfile for an agent service follows four best practices. First, use a multi-stage build: a builder stage installs dependencies, a runtime stage copies only what is needed to run. This reduces image size and attack surface. Second, create a non-root user and run the process as that user. Many container escape vulnerabilities require root; a non-root process limits their impact. Third, copy dependency files before application code. Docker caches each layer; if dependencies have not changed, the layer is reused and subsequent builds take seconds instead of minutes. Fourth, include a health check endpoint at/health so Kubernetes and load balancers can detect unhealthy instances and stop routing traffic to them.
Expose your agent as an HTTP service using an async Python web framework such as FastAPI. A minimal agent service has two endpoints: GET /healthreturns 200 OK when the service is ready to accept requests, andPOST /agent/run accepts a task payload and returns the agent response. Wrap all agent execution in exception handling; never let an unhandled exception crash the worker process.
With an understanding of containerisation with docker in place, the discussion can now turn to monitoring and alerting: the four golden signals, which builds directly on these foundations.
Google's SRE (Site Reliability Engineering) book defines four golden signals for monitoring any service: latency, traffic, errors, and saturation. Adapted for agent systems, these map as follows.
Latency: measure response time as a histogram to compute percentiles. p50 is typical performance. p95 is what most users experience. p99 is your tail latency, where the worst 1% of requests land. Alert when p99 exceeds your SLA threshold, commonly 30 seconds for agent tasks. Traffic: measure requests per second with a baseline from the previous week. Alert on sudden 3x spikes, which often indicate a runaway loop or external traffic anomaly.
Errors: measure the fraction of requests returning errors. Alert when error rate exceeds 1% over a five-minute window. Saturation: measure queue depth and token budget consumption. Alert when queue depth exceeds 500 tasks or when daily token budget consumption passes 80%. Add context window utilisation as a fifth signal specific to agents: alert when any agent run exceeds 80% of the model's context limit, which indicates a risk of silent quality degradation before an explicit error.
“The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.”
Beyer, B. et al. (2016). Site Reliability Engineering - Chapter 6: Monitoring Distributed Systems, Google LLC
The four golden signals provide a minimal but complete monitoring framework. They capture the user experience (latency, errors), the load on the system (traffic), and whether it is approaching its limits (saturation). For agent systems, context window utilisation is a fifth saturation metric unique to LLM-based services.
With an understanding of monitoring and alerting: the four golden signals in place, the discussion can now turn to a/b testing agent configurations, which builds directly on these foundations.
A/B testing (also called split testing) runs two agent configurations simultaneously: A is the control (current behaviour) and B is the treatment (proposed change). Traffic is split between them, and you measure which performs better on defined quality metrics before committing the change to 100% of traffic.
User assignment must be consistent: the same user should always see the same variant for a given experiment. Inconsistent assignment creates confusing experiences and contaminates the measurement. Hash the user ID together with the experiment name, normalise to a 0-1 range, and assign variant B if the value is below the traffic split percentage. This is deterministic: no state is required.
Define the success metrics before starting the experiment, not after. The metrics that matter for agents are: task completion rate (did the agent complete the intended task?), token efficiency (tokens used per successfully completed task), error rate, and where applicable, user satisfaction ratings. Run the experiment until you have sufficient statistical power to distinguish a real effect from noise. For most agent changes, 500 requests per variant is a minimum; 2,000 per variant provides more confidence.
Common misconception
“You can test a new agent configuration by deploying it and watching the dashboard for a few hours.”
A few hours of monitoring rarely provides sufficient statistical power to distinguish a real performance change from random variation. Define success metrics before deploying. Calculate the minimum sample size needed to detect the effect size you care about. Run the experiment until you reach that sample size. Stopping early based on early trends leads to false conclusions. Use consistent variant assignment (same user always gets same variant) to avoid contamination.
With an understanding of a/b testing agent configurations in place, the discussion can now turn to blue-green deployment and rollback, which builds directly on these foundations.
Blue-green deployment maintains two production environments simultaneously. Blue is the current live version receiving 100% of traffic. Green is the new version deployed in parallel but initially receiving no traffic. You shift traffic gradually: 5% to green, monitor for ten minutes, then 25%, then 100%. If any alert fires during the gradual shift, you route all traffic back to blue instantly. The rollback takes seconds because blue is still running.
Define rollback trigger conditions before the deployment begins, not during an incident. Standard conditions: error rate exceeds 5% over a two-minute window; p99 latency exceeds 60 seconds; cost per request increases by more than 50%; any new exception type appears at a rate above 0.1%. Automate the rollback: a monitoring script that watches these conditions and switches traffic back to blue without requiring a human to be paged, assess, and act.
NIST AI Risk Management Framework (AI RMF) Manage 4.1 requires post-deployment monitoring and the ability to respond when AI systems perform outside expected parameters. Automated rollback satisfies this requirement for the category of failures detectable through metrics. Human review procedures satisfy it for subtler quality degradation that metrics do not capture.
With an understanding of blue-green deployment and rollback in place, the discussion can now turn to latency optimisation, which builds directly on these foundations.
Profile before optimising. Run a latency profiling session to identify where time is actually spent before making any changes. The most common surprise is that application overhead (JSON serialisation, database queries, network round trips to tool APIs) dominates over LLM inference for simple tasks. Optimising the LLM call when the bottleneck is a slow database query wastes time.
For LLM-specific latency, the most impactful technique is streaming: instead of waiting for the complete response, stream tokens as they are generated and display partial results to the user. This eliminates perceived latency without changing actual generation speed. For agents with long, static system prompts, use prompt caching: Anthropic's API supports a cache_control parameter that caches the system prompt across calls for up to five minutes, reducing time-to-first-token (TTFT) significantly on cached calls.
For agents that make multiple independent tool calls, implement parallel execution. If the agent needs to call a weather API and a database query simultaneously, executing them concurrently rather than sequentially halves that step's latency. For agents with growing conversation history, implement a sliding context window or summarisation step to prevent context window exhaustion as described in the opening case study.
You deployed a new agent version (v1.1) three hours ago with a revised system prompt. Monitoring shows error rate has risen from 0.3% to 2.8% and p99 latency from 8s to 42s. Based on the rollback trigger conditions in this module, should an automatic rollback have been triggered?
After rolling back v1.1, you discover the new system prompt caused the agent to make three extra tool calls per request. Which optimisation technique would most reduce the per-request latency impact of these extra calls?
You want to test a new model (claude-sonnet-4-6 replacing claude-opus-4-6) on 5% of your traffic. Which approach should you use?
Why is it important to profile latency before making any optimisation changes?
Chapter 6: Monitoring Distributed Systems
The source of the four golden signals framework adapted for agent monitoring in Section 21.3. Published by Google LLC, available at sre.google/sre-book.
NIST AI Risk Management Framework (AI RMF 1.0), January 2023
Manage 4.1: Post-deployment monitoring
Establishes the requirement for ongoing monitoring and response capability after AI deployment. Referenced in Section 21.5 as the regulatory basis for automated rollback procedures.
docs.docker.com/develop/dev-best-practices
Official Docker guidance for multi-stage builds, layer caching, and non-root user configuration referenced in Section 21.2.
Anthropic Prompt Caching documentation
docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Official documentation for the cache_control parameter. Referenced in Section 21.6 as the technique for reducing time-to-first-token on repeated system prompt content.
fastapi.tiangolo.com
The async Python web framework recommended in Section 21.2 for exposing agents as HTTP services with automatic OpenAPI documentation.
Module 21 of 25 · Advanced Mastery