CPD timing for this level
Advanced time breakdown
This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.
What changes at this level
Level expectations
I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.
Agentic systems, safety, monitoring, and governance aligned to NIST AI RMF and ISO 23894.
Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.
- A production system sketch: boundaries, permissions, fallbacks, and what is logged for audit and debugging.
- A governance and risk register entry mapped to NIST AI RMF categories, with owners and evidence.
- A one page incident runbook for AI failure: triage, containment, comms, and post-incident fixes.
AI Advanced
CPD tracking
Fixed hours for this level: 12. Timed assessment time is included once on pass.
View in My CPDThis level focuses on operating AI like a real system: boundaries, failure handling, and governance you can defend. It aligns well with the kind of thinking expected in:
- NIST AI RMF 1.0: lifecycle risk management and organisational controls.
- ISO/IEC 23894: risk management for AI, including monitoring and change control.
- ISO/IEC 27001 oriented governance: evidence, ownership, and audit-ready change.
- Cloud architecture certifications (AWS, Azure, Google Cloud): production constraints, reliability, and cost trade-offs.
AI systems and model architectures
A model is a component that maps inputs to outputs. An AI system is the full product around it: interfaces, data flow, guardrails, monitoring, and the operational process that keeps outputs useful. At scale, system design usually dominates the outcome. The same model can look brilliant or useless depending on how it is integrated.
Models rarely operate alone because real inputs are messy and real decisions have constraints. You need routing, caching, authentication, permissions, and careful handling of failures. You also need data sources the system can trust. Without that, the model becomes a confident narrator of whatever it last saw.
The moment you put a model behind an API, you are doing inference.
In production, inference has strict budgets. You have cost budgets, latency budgets, and reliability budgets. Those budgets shape architecture more than a training run does.
One pattern is batch inference. You run predictions on a schedule, store results, and serve them fast later. This works well for things like nightly fraud scoring, content tagging, or pricing suggestions. The trade off is freshness. If the world changes at noon, your results might not catch up until tomorrow.
Another pattern is real time inference. Requests hit an API, the system calls the model, and the result returns immediately. This is common in ranking, moderation, and interactive assistants. Here latency matters.
Latency is not just performance vanity. It changes user behaviour and it changes system load. A slow model can create backlogs, timeouts, and cascading failures.
A third pattern is retrieval augmented systems. You keep a data store of documents, records, or snippets, retrieve relevant pieces at request time, then feed them into the model. This is often called retrieval augmented generation.
The architecture shifts the problem from "make the model smarter" to "make the data pipeline reliable". Retrieval quality, permissions, and content freshness become the main levers.
At scale, orchestration and data flow matter more than raw accuracy.
If you cannot trace what data was used, what model version ran, and why a decision happened, you cannot operate the system safely. Good architecture makes failures visible, limits blast radius, and makes improvements repeatable.
Typical production AI system
Separation of concerns keeps systems operable.
System boundaries (the part people skip, then regret)
A boundary is where you decide what the system is allowed to do. Boundaries are not only technical. They are behavioural. For example: “This assistant can explain policy, but it cannot approve refunds.” That is a boundary. It protects the business and it protects users.
My opinion: if you cannot state the boundary in plain English, you probably have not built it. You have hoped for it. Hope is not a control strategy, even when it is written in a product requirements document.
Check your understanding of AI system architecture
What is the difference between a model and an AI system
Why do models rarely operate alone in production
What does inference mean
Why is latency a design constraint
When is batch inference a good fit
What is a key trade off of batch inference
What does retrieval augmented generation add to a system
Why does orchestration exist
Scenario: Your model is accurate, but you cannot explain one harmful output to a regulator. What part of the system failed
Scaling, cost and reliability in AI systems
Scaling is not a single knob. It is a set of constraints you discover when traffic hits. The first surprise is that models are only one part of the system. Data fetches, feature generation, retrieval calls, and post processing often become the bottleneck before the model does.
Stateless inference services are easier to deploy and easier to recover. You can restart a bad instance without losing user state. You can spread load across instances without sticky sessions. If you need state, push it to a data store with clear ownership and clear timeouts.
Two levers help before you add more compute: caching and batching. Caching is serving repeated requests from memory or a fast store. Batching is grouping multiple requests into one model call so you use hardware more efficiently. Both reduce cost, but both can change behaviour. Caching can serve stale answers. Batching can increase latency for the first request in the batch. These are product decisions, not just infrastructure choices.
Scaling data can hurt more than scaling models. Retrieval adds network calls, index lookups, and permission checks. Feature pipelines add joins, transformations, and schema drift. If your data path is slow, scaling the model will not fix it. You are just making the wrong part faster.
Cost shows up as cost per request, and the hidden multiplier is how many steps your system takes per request. One model call is expensive. A model call plus retrieval, plus reranking, plus embeddings, plus retries can turn a single user action into a small workflow. The user sees one answer. You pay for the whole graph.
Embeddings and retrieval have their own cost curve. Embeddings can be expensive to compute and expensive to store at scale. Retrieval can be cheap per query and still expensive overall because it runs for every request. The easiest way to lose money is to add a pipeline step that runs always, even when it only helps sometimes.
Overprovisioning is the second way to lose money. Keeping GPUs warm for peak traffic feels safe, but it can make costs explode if the peak is rare. Autoscaling helps, but cold starts can hurt latency. This is why you trade off accuracy, latency, and cost as a trio. A slower model might be more accurate. A faster model might be cheaper. A cached answer might be good enough most of the time.
Reliability is where the real lessons are. Models fail in boring ways. They time out. They return empty output. They return something plausible and wrong. They get slower under load. They hit rate limits from an upstream provider. Partial failures are normal in distributed systems, so assume they will happen.
Retries are useful, but they are dangerous under load. If every request times out and you retry immediately, you double traffic into a system that is already struggling. This is why you need a retry policy.
A good retry policy uses backoff and caps. It retries only when it makes sense, like transient network issues, not when a model is overloaded. It also needs per request budgets, so retries do not turn a slow failure into an outage.
The practical goal is graceful degradation.
That can mean switching to a smaller model, skipping retrieval, returning a cached answer, or showing a clear fallback message. The key is that failure is expected, not exceptional. Your architecture should make the fallback path explicit and testable.
What I audit first when someone says “we are production ready”
I do not start with the model card. I start with the boring bits: rate limiting, authentication, logging, and rollbacks. If those are weak, the rest is theatre.
Here is the quick list I use: Can we turn it off. Can we fall back. Can we explain what happened. Can we prove what data it used. Can we prove who accessed what. If the answer is no, it is not production ready. It might be impressive, but it is not safe.
Scaling and failure in AI systems
Assume load and failure. Design the cache and fallback paths early.
Check your understanding of scaling and reliability
What is horizontal scaling
Why do stateless services scale more easily
Name two levers that can reduce cost before adding more compute
Why can scaling data hurt more than scaling models
What is a common hidden cost driver in AI systems
Scenario: An upstream provider rate limits you and latency spikes. What is a sensible reliability response
Why can retries be dangerous under load
What should a retry policy define
What is graceful degradation
Why should failure be treated as expected in architecture
Evaluation, monitoring and governance in production AI
Evaluation in production is not a single score. It is a series of checks that answer one question: does the system still help the business without creating unacceptable harm. That requires measurement before launch, after launch, and while the world changes.
Offline evaluation is what you do on datasets you control. It is useful for comparing versions and catching obvious regressions. It also has blind spots. Offline data is usually cleaner than reality, and it rarely contains the full cost of mistakes. Online evaluation is what you do in the live system, where users, latency, and edge cases are real. A model that looks strong offline can still fail online if it changes user behaviour or breaks workflows.
Accuracy alone is not enough because the cost of mistakes is not symmetric. A fraud system that misses fraud can be catastrophic. A moderation system that over blocks can silence legitimate users. In these cases you reach for precision and recall, then connect them to real outcomes like chargebacks, review workload, or user churn.
Even strong metrics are not stable. You need to monitor performance over time because production data is a moving target. The metric you track depends on the system, but the habit is the same: measure, investigate, and learn. If you cannot explain why a metric moved, you cannot fix it.
Monitoring is the boring work that saves you. Start with inputs. Are you seeing new categories, missing fields, unusual ranges, or sudden format changes. Then monitor outputs. Are scores shifting, are confidence values drifting upward, are certain groups being flagged more often. Finally monitor system health: latency, error rate, and rate limiting. If the system is slow, it will change user behaviour and it will change your data.
Alerting is where good teams become noisy teams. Too many alerts create alert fatigue, and then real problems are ignored. You want alerts that are actionable, tied to clear owners, and paired with a playbook. False positives in monitoring are not harmless. They burn trust and time.
Drift is the quiet killer. Data drift is when inputs change. Concept drift is when the meaning of the target changes, even if inputs look similar. A credit scoring model can see stable features while repayment behaviour changes during a downturn. A moderation model can see similar text while norms and tactics change.
Retraining schedules matter because drift does not wait for your roadmap. Some systems need periodic retraining. Others need trigger based retraining when drift crosses a threshold. Either way, you should treat retraining like a release, with the same discipline: evaluation, rollout, rollback, and audit trails.
This is where governance stops being paperwork and becomes operations. Someone must own the model and the system around it. Decisions about thresholds, fallbacks, and acceptable harm are product and risk decisions, not just ML decisions.
Good governance includes documentation and traceability. You want to know which data, which features, which model version, and which thresholds produced an outcome. Human oversight is part of the design. It needs authority, not just review. And you need a shutdown plan. If the system is causing harm or you cannot understand its behaviour, you stop it, switch to a safe fallback, and investigate.
Governance in plain English: good, bad, best practice
CPD evidence prompt (copy friendly)
Use this as a clean CPD entry. Keep it short and specific. If you can attach an artefact, do it.
AI system lifecycle in production
Governance surrounds the lifecycle, from data to retirement.
Check your understanding of monitoring and governance
What is the difference between offline and online evaluation
Why is accuracy alone not enough in many systems
In plain terms, what does precision measure
In plain terms, what does recall measure
Name three monitoring areas in production AI
What is alert fatigue and why is it dangerous
Scenario: A feature suddenly becomes mostly null after a backend change. What should you do first
What is data drift
What is concept drift
Why do retraining schedules matter
When should a system be shut down
