CPD timing for this level

Advanced time breakdown

This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.

Reading
26m
3,845 words · base 20m × 1.3
Labs
45m
3 activities × 15m
Checkpoints
15m
3 blocks × 5m
Reflection
32m
4 modules × 8m
Estimated guided time
2h 58m
Based on page content and disclosed assumptions.
Claimed level hours
12h
Claim includes reattempts, deeper practice, and capstone work.
The claimed hours are higher than the current on-page estimate by about 10h. That gap is where I will add more guided practice and assessment-grade work so the hours are earned, not declared.

What changes at this level

Level expectations

I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.

Anchor standards (course wide)
NIST AI Risk Management Framework (AI RMF 1.0)ISO/IEC 23894 (AI risk management)
Assessment intent
Modern systems

Agentic systems, safety, monitoring, and governance aligned to NIST AI RMF and ISO 23894.

Assessment style
Format: mixed
Pass standard
Coming next

Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.

Evidence you can save (CPD friendly)
  • A production system sketch: boundaries, permissions, fallbacks, and what is logged for audit and debugging.
  • A governance and risk register entry mapped to NIST AI RMF categories, with owners and evidence.
  • A one page incident runbook for AI failure: triage, containment, comms, and post-incident fixes.

AI Advanced

Level progress0%

CPD tracking

Fixed hours for this level: 12. Timed assessment time is included once on pass.

View in My CPD
Progress minutes
0.0 hours
CPD and certification alignment (guidance, not endorsed):

This level focuses on operating AI like a real system: boundaries, failure handling, and governance you can defend. It aligns well with the kind of thinking expected in:

  • NIST AI RMF 1.0: lifecycle risk management and organisational controls.
  • ISO/IEC 23894: risk management for AI, including monitoring and change control.
  • ISO/IEC 27001 oriented governance: evidence, ownership, and audit-ready change.
  • Cloud architecture certifications (AWS, Azure, Google Cloud): production constraints, reliability, and cost trade-offs.
How I want you to think at Advanced level
If you can explain the failure path, the fallback path, and the evidence trail, you are in the right place.
Good practice
Design with budgets: cost, latency, and reliability. Then decide which knob you can turn without harming users.
Bad practice
Best practice

AI systems and model architectures

Concept block
System shapes
Architecture is choosing boundaries, data paths, and safe defaults, not only choosing a model.
Architecture is choosing boundaries, data paths, and safe defaults, not only choosing a model.
Assumptions
Boundaries are explicit
Safety is designed in
Failure modes
Boundary confusion
No operational plan

A model is a component that maps inputs to outputs. An AI system is the full product around it: interfaces, data flow, guardrails, monitoring, and the operational process that keeps outputs useful. At scale, system design usually dominates the outcome. The same model can look brilliant or useless depending on how it is integrated.

Models rarely operate alone because real inputs are messy and real decisions have constraints. You need routing, caching, authentication, permissions, and careful handling of failures. You also need data sources the system can trust. Without that, the model becomes a confident narrator of whatever it last saw.

The most common Advanced mistake
This is the one that causes expensive incidents with very confident postmortems.
Common misconception
“If we pick a better model, the product becomes reliable.” Reliability is a system property. You earn it through boundaries, rate limits, fallbacks, and observability.
Good practice
Best practice

The moment you put a model behind an API, you are doing inference.

In production, inference has strict budgets. You have cost budgets, latency budgets, and reliability budgets. Those budgets shape architecture more than a training run does.

One pattern is batch inference. You run predictions on a schedule, store results, and serve them fast later. This works well for things like nightly fraud scoring, content tagging, or pricing suggestions. The trade off is freshness. If the world changes at noon, your results might not catch up until tomorrow.

Another pattern is real time inference. Requests hit an API, the system calls the model, and the result returns immediately. This is common in ranking, moderation, and interactive assistants. Here latency matters.

Latency is not just performance vanity. It changes user behaviour and it changes system load. A slow model can create backlogs, timeouts, and cascading failures.

A third pattern is retrieval augmented systems. You keep a data store of documents, records, or snippets, retrieve relevant pieces at request time, then feed them into the model. This is often called retrieval augmented generation.

The architecture shifts the problem from "make the model smarter" to "make the data pipeline reliable". Retrieval quality, permissions, and content freshness become the main levers.

At scale, orchestration and data flow matter more than raw accuracy.

If you cannot trace what data was used, what model version ran, and why a decision happened, you cannot operate the system safely. Good architecture makes failures visible, limits blast radius, and makes improvements repeatable.

Typical production AI system

Separation of concerns keeps systems operable.

User input -> API layer (auth, routing, validation)
API layer -> Model inference (bounded compute)
API layer to and from Data store (retrieval, features, permissions)
Monitoring and logging (latency, errors, quality signals)

System boundaries (the part people skip, then regret)

A boundary is where you decide what the system is allowed to do. Boundaries are not only technical. They are behavioural. For example: “This assistant can explain policy, but it cannot approve refunds.” That is a boundary. It protects the business and it protects users.

My opinion: if you cannot state the boundary in plain English, you probably have not built it. You have hoped for it. Hope is not a control strategy, even when it is written in a product requirements document.

Boundaries: good, bad, best practice
Good practice
Separate read actions from write actions. If the system can change state, treat it like an admin path. Log it and protect it.
Bad practice
Best practice

Check your understanding of AI system architecture

What is the difference between a model and an AI system

Why do models rarely operate alone in production

What does inference mean

Why is latency a design constraint

When is batch inference a good fit

What is a key trade off of batch inference

What does retrieval augmented generation add to a system

Why does orchestration exist

Scenario: Your model is accurate, but you cannot explain one harmful output to a regulator. What part of the system failed

Scaling, cost and reliability in AI systems

Concept block
Scale is a trade-off
Scaling increases capability and cost. Reliability comes from limits and budgets.
Scaling increases capability and cost. Reliability comes from limits and budgets.
Assumptions
Budgets are set
Failure is expected
Failure modes
Unbounded usage
Capacity surprises

Scaling is not a single knob. It is a set of constraints you discover when traffic hits. The first surprise is that models are only one part of the system. Data fetches, feature generation, retrieval calls, and post processing often become the bottleneck before the model does.

When people say scale, they usually mean capacity. Vertical scaling is making a single instance bigger, like more CPU, more GPU, or more memory. In AI systems, vertical scaling buys you headroom per request. Horizontal scaling buys you throughput and fault tolerance. You usually need both, but you should start by making the service stateless so scaling is possible.

Stateless inference services are easier to deploy and easier to recover. You can restart a bad instance without losing user state. You can spread load across instances without sticky sessions. If you need state, push it to a data store with clear ownership and clear timeouts.

Two levers help before you add more compute: caching and batching. Caching is serving repeated requests from memory or a fast store. Batching is grouping multiple requests into one model call so you use hardware more efficiently. Both reduce cost, but both can change behaviour. Caching can serve stale answers. Batching can increase latency for the first request in the batch. These are product decisions, not just infrastructure choices.

Scaling data can hurt more than scaling models. Retrieval adds network calls, index lookups, and permission checks. Feature pipelines add joins, transformations, and schema drift. If your data path is slow, scaling the model will not fix it. You are just making the wrong part faster.

Cost shows up as cost per request, and the hidden multiplier is how many steps your system takes per request. One model call is expensive. A model call plus retrieval, plus reranking, plus embeddings, plus retries can turn a single user action into a small workflow. The user sees one answer. You pay for the whole graph.

Embeddings and retrieval have their own cost curve. Embeddings can be expensive to compute and expensive to store at scale. Retrieval can be cheap per query and still expensive overall because it runs for every request. The easiest way to lose money is to add a pipeline step that runs always, even when it only helps sometimes.

Overprovisioning is the second way to lose money. Keeping GPUs warm for peak traffic feels safe, but it can make costs explode if the peak is rare. Autoscaling helps, but cold starts can hurt latency. This is why you trade off accuracy, latency, and cost as a trio. A slower model might be more accurate. A faster model might be cheaper. A cached answer might be good enough most of the time.

Reliability is where the real lessons are. Models fail in boring ways. They time out. They return empty output. They return something plausible and wrong. They get slower under load. They hit rate limits from an upstream provider. Partial failures are normal in distributed systems, so assume they will happen.

Retries are useful, but they are dangerous under load. If every request times out and you retry immediately, you double traffic into a system that is already struggling. This is why you need a retry policy.

A good retry policy uses backoff and caps. It retries only when it makes sense, like transient network issues, not when a model is overloaded. It also needs per request budgets, so retries do not turn a slow failure into an outage.

The practical goal is graceful degradation.

That can mean switching to a smaller model, skipping retrieval, returning a cached answer, or showing a clear fallback message. The key is that failure is expected, not exceptional. Your architecture should make the fallback path explicit and testable.

What I audit first when someone says “we are production ready”

I do not start with the model card. I start with the boring bits: rate limiting, authentication, logging, and rollbacks. If those are weak, the rest is theatre.

Here is the quick list I use: Can we turn it off. Can we fall back. Can we explain what happened. Can we prove what data it used. Can we prove who accessed what. If the answer is no, it is not production ready. It might be impressive, but it is not safe.

Scaling and failure in AI systems

Assume load and failure. Design the cache and fallback paths early.

Load increases: more requests and longer queues
Cache layer absorbs traffic: repeats served fast
Model service scales out: more instances handle throughput
Failure happens: timeouts, rate limits, partial dependency loss
Fallback path activates: smaller model, cached answer, or safe message

Check your understanding of scaling and reliability

What is horizontal scaling

Why do stateless services scale more easily

Name two levers that can reduce cost before adding more compute

Why can scaling data hurt more than scaling models

What is a common hidden cost driver in AI systems

Scenario: An upstream provider rate limits you and latency spikes. What is a sensible reliability response

Why can retries be dangerous under load

What should a retry policy define

What is graceful degradation

Why should failure be treated as expected in architecture

Evaluation, monitoring and governance in production AI

Concept block
Production evidence loop
In production you need evidence: monitor, respond, and improve with clear ownership.
In production you need evidence: monitor, respond, and improve with clear ownership.
Assumptions
Monitoring is meaningful
Governance is enforced
Failure modes
Metric gaming
No incident learning

Evaluation in production is not a single score. It is a series of checks that answer one question: does the system still help the business without creating unacceptable harm. That requires measurement before launch, after launch, and while the world changes.

Offline evaluation is what you do on datasets you control. It is useful for comparing versions and catching obvious regressions. It also has blind spots. Offline data is usually cleaner than reality, and it rarely contains the full cost of mistakes. Online evaluation is what you do in the live system, where users, latency, and edge cases are real. A model that looks strong offline can still fail online if it changes user behaviour or breaks workflows.

Accuracy alone is not enough because the cost of mistakes is not symmetric. A fraud system that misses fraud can be catastrophic. A moderation system that over blocks can silence legitimate users. In these cases you reach for precision and recall, then connect them to real outcomes like chargebacks, review workload, or user churn.

Even strong metrics are not stable. You need to monitor performance over time because production data is a moving target. The metric you track depends on the system, but the habit is the same: measure, investigate, and learn. If you cannot explain why a metric moved, you cannot fix it.

Monitoring is the boring work that saves you. Start with inputs. Are you seeing new categories, missing fields, unusual ranges, or sudden format changes. Then monitor outputs. Are scores shifting, are confidence values drifting upward, are certain groups being flagged more often. Finally monitor system health: latency, error rate, and rate limiting. If the system is slow, it will change user behaviour and it will change your data.

Alerting is where good teams become noisy teams. Too many alerts create alert fatigue, and then real problems are ignored. You want alerts that are actionable, tied to clear owners, and paired with a playbook. False positives in monitoring are not harmless. They burn trust and time.

Drift is the quiet killer. Data drift is when inputs change. Concept drift is when the meaning of the target changes, even if inputs look similar. A credit scoring model can see stable features while repayment behaviour changes during a downturn. A moderation model can see similar text while norms and tactics change.

Retraining schedules matter because drift does not wait for your roadmap. Some systems need periodic retraining. Others need trigger based retraining when drift crosses a threshold. Either way, you should treat retraining like a release, with the same discipline: evaluation, rollout, rollback, and audit trails.

This is where governance stops being paperwork and becomes operations. Someone must own the model and the system around it. Decisions about thresholds, fallbacks, and acceptable harm are product and risk decisions, not just ML decisions.

Good governance includes documentation and traceability. You want to know which data, which features, which model version, and which thresholds produced an outcome. Human oversight is part of the design. It needs authority, not just review. And you need a shutdown plan. If the system is causing harm or you cannot understand its behaviour, you stop it, switch to a safe fallback, and investigate.

Governance in plain English: good, bad, best practice

Governance that survives scrutiny
This is written for the real world where someone will ask you to justify a decision, not just build a demo.
Good practice
Assign ownership. One person or team is responsible for model changes, monitoring, and incident handling. Shared responsibility is often another name for nobody being responsible.
Bad practice
Best practice

CPD evidence prompt (copy friendly)

Use this as a clean CPD entry. Keep it short and specific. If you can attach an artefact, do it.

CPD note template
What I studied
Production AI architectures, scaling and reliability constraints, and governance practices that make systems auditable and safe.
What I practised
What changed in my practice
Evidence artefact

AI system lifecycle in production

Governance surrounds the lifecycle, from data to retirement.

Data collection: quality checks, access rules, consent
Training: versioning, documentation, repeatability
Deployment: rollout, fallbacks, human oversight
Monitoring: performance, drift, latency, error rates
Retraining or retirement: update safely or shut down
Governance: ownership, traceability, review, incident response

Check your understanding of monitoring and governance

What is the difference between offline and online evaluation

Why is accuracy alone not enough in many systems

In plain terms, what does precision measure

In plain terms, what does recall measure

Name three monitoring areas in production AI

What is alert fatigue and why is it dangerous

Scenario: A feature suddenly becomes mostly null after a backend change. What should you do first

What is data drift

What is concept drift

Why do retraining schedules matter

When should a system be shut down

Quick feedback

Optional. This helps improve accuracy and usefulness. No accounts required.