CPD timing for this level
Intermediate time breakdown
This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.
What changes at this level
Level expectations
I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.
Scenario based evaluation and pipeline decisions, including drift and governance basics.
Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.
- An evaluation plan for one real decision: metrics, thresholds, and what errors cost you.
- A small prompt or RAG workflow note: inputs, guardrails, tests, and a red-team example.
- A monitoring checklist: drift signals, quality sampling, and a clear rollback or disable plan.
AI Intermediate
CPD tracking
Fixed hours for this level: 10. Timed assessment time is included once on pass.
View in My CPDI write this level to be usable as CPD evidence. It also deliberately covers skills that map well to respected programmes, without claiming endorsement:
- BCS Foundation Certificate in Artificial Intelligence: concepts, lifecycle thinking, and responsible practice.
- NIST AI Risk Management Framework (AI RMF 1.0): risk framing, measurement, and governance habits.
- ISO/IEC 23894: AI risk management vocabulary and operational controls.
- Microsoft Azure AI Engineer (AI-102): practical evaluation, deployment patterns, and monitoring instincts.
Models, parameters and training dynamics
A model is still a function that turns input into output. At Intermediate level, the useful question is what kind of function it is and what it can store. Modern models are flexible pattern machines. They are not a library of facts. They are a set of learned behaviours shaped by data and training.
The behaviours live in parameters. You can think of them as tiny knobs inside the model. Training turns knobs until the model produces outputs that match the examples often enough.
More parameters gives the model more capacity. It can represent more subtle patterns. It can also memorise noise if you let it.
Scale matters because the world is messy. If you want a model to handle more languages, more topics, or more edge cases, it usually needs more capacity and more data. Scale is not a free win. It increases cost, increases training time, and increases the number of ways training can go wrong.
Training is a loop. You show examples, the model predicts, you measure how wrong it was, and you update the parameters to be a bit less wrong next time. The measure of wrongness is the loss.
Loss is not a moral judgement. It is a learning signal.
Under the hood, the model uses a gradient to decide how to adjust parameters.
You do not need the equations yet. The intuition is enough. If the loss is high, the gradient points toward changes that should lower it. If the training setup is stable, repeating this process steadily improves performance.
This is where overfitting shows up.
Underfitting is the opposite. The model is too simple or not trained enough, so it cannot learn the pattern even on the training data.
The goal is generalisation.
Training is expensive and brittle because small changes can have large effects. Data quality issues, hidden leakage, unstable learning rates, poor shuffling, or a mismatch between training and real inputs can collapse learning. Even when training works, the model can learn shortcuts. It might latch onto the background of an image instead of the object. It might learn the formatting of an email instead of the message. These failures are not rare. They are the default unless you design against them.
What I look for when training “works” but the system still fails
When someone tells me “the model is accurate”, I do not argue. I ask questions. Accurate on what, measured how, and compared to which baseline. Then I ask the question people avoid: what is the model using to be accurate. Shortcut learning is the usual culprit. The model found a pattern that is true in the training data and fragile in reality.
A practical example. Imagine you are classifying “urgent” support tickets. If most urgent tickets were written by the night shift for three months, a model may quietly learn that “night shift writing style” means urgent. It looks brilliant in testing and then falls apart when staffing changes. This is why I care about datasets like I care about code. They contain assumptions, and assumptions are where failures breed.
Training loop intuition
Predict, measure error, update, repeat.
Quick check: models and training dynamics
What is a parameter
Why does scale matter
What happens in a training loop
What is a loss function used for
What is a gradient in plain terms
What is overfitting
What is generalisation
Why is training expensive
Scenario: Your validation score improves but production quality drops. Name one realistic cause
Data, features and representation
Raw data is rarely ready for a model. Even when it looks clean to a human, it usually contains missing values, inconsistent formats, and little traps like duplicated records. A model does not understand intent. It only sees the numbers you give it, so messy inputs quietly become messy behaviour.
People call this feature engineering. In practice it is careful translation. You are turning a real world situation into signals a model can learn from. If you pick the wrong signals, the model can look accurate in testing and still fail in production because it learned the wrong shortcut.
For images, raw pixels are numbers already, but not good ones by themselves. Lighting, cropping, and camera differences can dominate the signal. For time series, the shape over time matters. Averages can erase patterns, and misaligned timestamps can create fake trends that a model happily learns.
When a model behaves strangely, look at representation before you blame the algorithm. Small encoding choices can flip what the model can and cannot learn. This is why data work is not "preprocessing". It is the main engineering work.
From raw data to features
How inputs become a feature vector the model can learn from.
Quick check: features and representation
Why is raw data rarely usable directly
What is a feature
What is representation in this context
Give an example of a feature that could leak the future
What is an embedding used for
Scenario: A model performs well offline, but fails on new products the business launches. What representation risk might explain it
Why can embeddings hide problems
What does dimensionality refer to
Why can high dimensional features lead to brittle models
What is noise and why does it matter
Evaluation, metrics and failure analysis
Accuracy is an easy number to like because it feels clean. The problem is that it hides what you actually care about. In a spam filter, you can get high accuracy by declaring "not spam" for almost everything, because most email is not spam. The model looks great on paper and useless in practice.
What you want depends on the job. If you are blocking fraud, a false negative means you miss a bad transaction. If you are flagging innocent customers, a false positive means you cause real harm. Evaluation is choosing what kind of mistake is acceptable and proving the system is making the right trade.
Two scores matter: how the model behaves on the data it learned from, and how it behaves on data it has never seen. Training performance is often optimistic because the model can memorise. Real world performance is harder because the inputs change and the environment changes. Good evaluation separates these on purpose.
For classification problems, two practical metrics are precision and recall. Precision answers "when the model says positive, how often is it right". Recall answers "of the real positives, how many did it catch". A spam filter with high recall catches most spam, but it might also block legitimate email. A fraud detector with high precision avoids annoying customers, but it might miss attacks.
For regression problems, you are predicting a number, like delivery time or house price. Here the question becomes "how far off are we". Metrics like mean absolute error are popular because they map to a simple story: average miss distance. Even without the formula, the idea is to measure error in the same units your users experience.
Evaluation also needs a sanity check for overfitting and underfitting. Overfitting is when training looks great and real performance drops. Underfitting is when both are poor because the model cannot learn the pattern. The fix is rarely "more metrics". It is usually better data, better representation, or a simpler model that is easier to trust.
Finally, production systems fail quietly. You can ship a model that passes every offline test and still break in the real world because the input distribution shifts. A spam campaign changes writing style. A fraud ring adapts. A new product changes customer behaviour. The model is not wrong in a dramatic way. It is just slowly less useful.
This is why evaluation is not a one time exam. It is a lifecycle. You validate before release, you test on untouched data, and you keep watching after deployment. If the system starts drifting, you want to notice early and know what to do next.
The human cost of metrics (yes, it matters)
Technical people sometimes treat metrics like they are morally neutral. They are not. A false positive can be an annoyed customer, a delayed service, or a person wrongly flagged. A false negative can be money lost, harm missed, or abuse allowed through. When I say “match the metric to the decision”, I mean match it to the harm profile and the workload you are creating for humans.
I also want you to respect baselines. If a human process already catches 95 percent of bad cases, your model must beat that in a way that reduces harm, not just moves it around. If the model makes the review queue unbearable, it will be disabled. I have seen this happen more times than I would like.
CPD evidence prompt (copy friendly)
Write this as a short note for your CPD log. Keep it honest and specific.
Model evaluation in practice
Separate data, test honestly, then monitor in the real world.
Quick check: evaluation and metrics
Why can accuracy be misleading in a spam filter
What does accuracy measure
What is precision in plain terms
What is recall in plain terms
Scenario: False positives are expensive and embarrassing. Which direction do you usually tune, precision or recall
Why can training performance look better than real world performance
What is a common sign of overfitting
What is distribution shift
Why must evaluation match the real world use
Name one silent production failure mode
Deployment, monitoring and drift
Deployment is where good models go to die. The same model can behave very differently depending on latency, scaling, input validation, and how the product uses the output. A clean offline score does not protect you from a broken data pipeline, missing logging, or a workflow that encourages people to over trust the system.
Monitoring is your early warning system. You watch three things.
- Inputs. Are users or upstream systems sending different data than before.
- Outputs. Are prediction rates, errors, and edge cases changing.
- System health. Are latency and failures rising, so the model is skipped or timeouts happen.
Drift is often a slow change, so the first sign is a small shift in metrics, not an outage. You should design for action. Who investigates. Who can pause the feature. What is the safe fallback.
Deployment and monitoring loop
Treat model changes like a release, then watch reality.
Quick check: deployment, monitoring and drift
Why can a model fail after deployment even if offline tests look good
Name three monitoring areas for production AI
What is drift in plain terms
Why is logging important in a model service
What is a safe fallback
Scenario: Monitoring shows a sudden jump in “high risk” predictions after a product change. What is a sensible first step
What should happen when monitoring flags a serious risk
Why do timeouts matter
What is a practical sign of input drift
What is a practical sign of output drift
Responsible AI, limits and deployment risks
AI systems do not understand intent or truth. They learn patterns that were useful in the data they saw. That can look like understanding because the outputs are fluent or confident. Underneath, the model is still guessing based on correlations. If the context changes, the guess changes.
This creates two different kinds of failure. Capability limits are what the model cannot reliably do, even with good governance. A content moderation model might struggle with sarcasm or coded language. A hiring model might not detect that a job description itself is biased. A credit scoring model might be accurate on last year’s economy and wrong in a downturn.
Governance failures are when the organisation deploys a system without clear goals, boundaries, or accountability. That includes using a model outside the environment it was tested for, copying a score into decisions without challenge, or treating automation as a way to avoid responsibility. These failures are common because they feel efficient right up until they become a public incident.
Another harm is automation overreach. If a tool is good at ranking candidates, it is tempting to let it decide who gets screened out. If a score is produced, someone will use it as if it is precise. This is how misplaced trust appears. A model is not accountable. People are.
Responsible AI is an engineering discipline. It is data work, evaluation work, monitoring work, and incident response work. Ethics matters, but the day to day work is building systems that fail safely, surface uncertainty, and keep humans responsible for decisions.
AI system risk lifecycle
Where risks appear and where human review and governance must apply.
Quick check: responsible AI and deployment risks
Why do AI systems not understand intent or truth
What is the difference between a capability limit and a governance failure
Give one example of bias in a real system
Why is automation overreach risky
Scenario: A team copies a model score into a decision and says 'the model decided'. What governance failure is this
What is drift and why is it dangerous
What does human in the loop mean in practice
Why does explainability matter
What does accountability mean for an AI system
Why is responsible AI an engineering discipline
