CPD timing for this level

Intermediate time breakdown

This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.

Reading
36m
5,297 words · base 27m × 1.3
Labs
75m
5 activities × 15m
Checkpoints
25m
5 blocks × 5m
Reflection
32m
4 modules × 8m
Estimated guided time
3h 48m
Based on page content and disclosed assumptions.
Claimed level hours
10h
Claim includes reattempts, deeper practice, and capstone work.
The claimed hours are higher than the current on-page estimate by about 7h. That gap is where I will add more guided practice and assessment-grade work so the hours are earned, not declared.

What changes at this level

Level expectations

I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.

Anchor standards (course wide)
NIST AI Risk Management Framework (AI RMF 1.0)ISO/IEC 23894 (AI risk management)
Assessment intent
Applied

Scenario based evaluation and pipeline decisions, including drift and governance basics.

Assessment style
Format: scenario
Pass standard
Coming next

Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.

Evidence you can save (CPD friendly)
  • An evaluation plan for one real decision: metrics, thresholds, and what errors cost you.
  • A small prompt or RAG workflow note: inputs, guardrails, tests, and a red-team example.
  • A monitoring checklist: drift signals, quality sampling, and a clear rollback or disable plan.

AI Intermediate

Level progress0%

CPD tracking

Fixed hours for this level: 10. Timed assessment time is included once on pass.

View in My CPD
Progress minutes
0.0 hours
CPD and certification alignment (guidance, not endorsed):

I write this level to be usable as CPD evidence. It also deliberately covers skills that map well to respected programmes, without claiming endorsement:

  • BCS Foundation Certificate in Artificial Intelligence: concepts, lifecycle thinking, and responsible practice.
  • NIST AI Risk Management Framework (AI RMF 1.0): risk framing, measurement, and governance habits.
  • ISO/IEC 23894: AI risk management vocabulary and operational controls.
  • Microsoft Azure AI Engineer (AI-102): practical evaluation, deployment patterns, and monitoring instincts.
How to use this level (so it actually changes how you work)
I’m aiming for two audiences at the same time. If you are new, I keep the language calm and concrete. If you are technical, I keep the claims honest and the trade-offs explicit.
Good practice
Read one section, do one tool, then write one paragraph of what you observed. If you cannot explain what changed and why, go back. The goal is judgement, not vocabulary collection.
Bad practice
Best practice
Recommended reading: Before diving into intermediate concepts, consider reading AI Fundamentals Explained for a clear overview of data, models, evaluation, and deployment from first principles.

Models, parameters and training dynamics

Concept block
Training loop shape
Training is a loop: predict, measure error, update parameters, repeat with discipline.
Training is a loop: predict, measure error, update parameters, repeat with discipline.
Assumptions
Loss matches the decision
Validation is separate
Failure modes
Overfitting
Leakage

A model is still a function that turns input into output. At Intermediate level, the useful question is what kind of function it is and what it can store. Modern models are flexible pattern machines. They are not a library of facts. They are a set of learned behaviours shaped by data and training.

The behaviours live in parameters. You can think of them as tiny knobs inside the model. Training turns knobs until the model produces outputs that match the examples often enough.

A misconception I correct early
If you remember one sentence from this section, make it this: a model is not a database.
Common misconception
“The model knows facts.” What it actually learns is a set of tendencies: patterns in how inputs map to outputs. It can sound like knowledge, but it does not come with a source, a timestamp, or a guarantee.
Good practice
Best practice

More parameters gives the model more capacity. It can represent more subtle patterns. It can also memorise noise if you let it.

Scale matters because the world is messy. If you want a model to handle more languages, more topics, or more edge cases, it usually needs more capacity and more data. Scale is not a free win. It increases cost, increases training time, and increases the number of ways training can go wrong.

Training is a loop. You show examples, the model predicts, you measure how wrong it was, and you update the parameters to be a bit less wrong next time. The measure of wrongness is the loss.

Loss is not a moral judgement. It is a learning signal.

Under the hood, the model uses a gradient to decide how to adjust parameters.

You do not need the equations yet. The intuition is enough. If the loss is high, the gradient points toward changes that should lower it. If the training setup is stable, repeating this process steadily improves performance.

This is where overfitting shows up.

Underfitting is the opposite. The model is too simple or not trained enough, so it cannot learn the pattern even on the training data.

The goal is generalisation.

Training is expensive and brittle because small changes can have large effects. Data quality issues, hidden leakage, unstable learning rates, poor shuffling, or a mismatch between training and real inputs can collapse learning. Even when training works, the model can learn shortcuts. It might latch onto the background of an image instead of the object. It might learn the formatting of an email instead of the message. These failures are not rare. They are the default unless you design against them.

What I look for when training “works” but the system still fails

When someone tells me “the model is accurate”, I do not argue. I ask questions. Accurate on what, measured how, and compared to which baseline. Then I ask the question people avoid: what is the model using to be accurate. Shortcut learning is the usual culprit. The model found a pattern that is true in the training data and fragile in reality.

A practical example. Imagine you are classifying “urgent” support tickets. If most urgent tickets were written by the night shift for three months, a model may quietly learn that “night shift writing style” means urgent. It looks brilliant in testing and then falls apart when staffing changes. This is why I care about datasets like I care about code. They contain assumptions, and assumptions are where failures breed.

Training dynamics: good, bad, best practice
Good practice
Keep a baseline model and compare against it. If you cannot beat a simple baseline honestly, you are not ready for complexity.
Bad practice
Best practice

Training loop intuition

Predict, measure error, update, repeat.

Input data -> Model -> Output
Compare output to expected result -> Loss calculation
Use gradient signal -> Parameter update
Repeat over many examples and many passes

Quick check: models and training dynamics

What is a parameter

Why does scale matter

What happens in a training loop

What is a loss function used for

What is a gradient in plain terms

What is overfitting

What is generalisation

Why is training expensive

Scenario: Your validation score improves but production quality drops. Name one realistic cause

Data, features and representation

Concept block
Features control behaviour
Representation choices decide what the model can notice and what it will ignore.
Representation choices decide what the model can notice and what it will ignore.
Assumptions
Meaning is preserved
Bias is examined
Failure modes
Proxy features
Schema drift

Raw data is rarely ready for a model. Even when it looks clean to a human, it usually contains missing values, inconsistent formats, and little traps like duplicated records. A model does not understand intent. It only sees the numbers you give it, so messy inputs quietly become messy behaviour.

The first job is to decide what the model should pay attention to. A Features can be obvious, like the total price of a basket, or subtle, like the time since last login. Good features are stable, meaningful, and available at prediction time. Bad features leak information from the future or smuggle in an identifier that lets the model memorise.
Feature work is where most AI projects live or die
My opinion: if the feature definition is vague, the model will punish you later with confidence and nonsense.
Good practice
For each feature, write what it represents, how it is calculated, and when it is available. If the answer is fuzzy, the feature is risky.
Bad practice
Best practice

People call this feature engineering. In practice it is careful translation. You are turning a real world situation into signals a model can learn from. If you pick the wrong signals, the model can look accurate in testing and still fail in production because it learned the wrong shortcut.

Representation is the bridge between raw input and features. A Sometimes the simplest representation is the best one. A single number for "days since password reset" can beat a complicated text field that mostly contains noise.
Text, images, and time series all need different treatments. For text, you might start with simple counts or categories, then move to an Embeddings are powerful because they compress meaning into numbers, but they also hide failure modes. If your embedding model was trained on different language or different context, it can flatten important distinctions.

For images, raw pixels are numbers already, but not good ones by themselves. Lighting, cropping, and camera differences can dominate the signal. For time series, the shape over time matters. Averages can erase patterns, and misaligned timestamps can create fake trends that a model happily learns.

All of this affects More dimensions can capture richer detail, but it increases the chance of learning coincidences. It also increases the cost of training and the risk that your model learns a brittle rule that only holds in the training set.
The hardest failures are silent. A If your pipeline adds noise, a model can still reduce loss by fitting patterns that do not generalise. You see an improvement on a familiar dataset and assume the model is smarter. In reality, you changed the data in a way that made the benchmark easier or leaked a hint.

When a model behaves strangely, look at representation before you blame the algorithm. Small encoding choices can flip what the model can and cannot learn. This is why data work is not "preprocessing". It is the main engineering work.

From raw data to features

How inputs become a feature vector the model can learn from.

Raw inputs: text, images, events, timestamps
Cleaning and transformation: normalize, dedupe, handle missing, align time
Features: counts, rates, categories, embeddings
Feature vector -> Model input

Quick check: features and representation

Why is raw data rarely usable directly

What is a feature

What is representation in this context

Give an example of a feature that could leak the future

What is an embedding used for

Scenario: A model performs well offline, but fails on new products the business launches. What representation risk might explain it

Why can embeddings hide problems

What does dimensionality refer to

Why can high dimensional features lead to brittle models

What is noise and why does it matter

Evaluation, metrics and failure analysis

Concept block
Evaluation is evidence
Evaluation is not one metric. It is evidence that the system is safe enough for the decision.
Evaluation is not one metric. It is evidence that the system is safe enough for the decision.
Assumptions
Metrics match harms
You inspect failures
Failure modes
Good average, bad edge cases
Threshold optimism

Accuracy is an easy number to like because it feels clean. The problem is that it hides what you actually care about. In a spam filter, you can get high accuracy by declaring "not spam" for almost everything, because most email is not spam. The model looks great on paper and useless in practice.

What you want depends on the job. If you are blocking fraud, a false negative means you miss a bad transaction. If you are flagging innocent customers, a false positive means you cause real harm. Evaluation is choosing what kind of mistake is acceptable and proving the system is making the right trade.

Two scores matter: how the model behaves on the data it learned from, and how it behaves on data it has never seen. Training performance is often optimistic because the model can memorise. Real world performance is harder because the inputs change and the environment changes. Good evaluation separates these on purpose.

For classification problems, two practical metrics are precision and recall. Precision answers "when the model says positive, how often is it right". Recall answers "of the real positives, how many did it catch". A spam filter with high recall catches most spam, but it might also block legitimate email. A fraud detector with high precision avoids annoying customers, but it might miss attacks.

For regression problems, you are predicting a number, like delivery time or house price. Here the question becomes "how far off are we". Metrics like mean absolute error are popular because they map to a simple story: average miss distance. Even without the formula, the idea is to measure error in the same units your users experience.

Evaluation also needs a sanity check for overfitting and underfitting. Overfitting is when training looks great and real performance drops. Underfitting is when both are poor because the model cannot learn the pattern. The fix is rarely "more metrics". It is usually better data, better representation, or a simpler model that is easier to trust.

Finally, production systems fail quietly. You can ship a model that passes every offline test and still break in the real world because the input distribution shifts. A spam campaign changes writing style. A fraud ring adapts. A new product changes customer behaviour. The model is not wrong in a dramatic way. It is just slowly less useful.

This is why evaluation is not a one time exam. It is a lifecycle. You validate before release, you test on untouched data, and you keep watching after deployment. If the system starts drifting, you want to notice early and know what to do next.

The human cost of metrics (yes, it matters)

Technical people sometimes treat metrics like they are morally neutral. They are not. A false positive can be an annoyed customer, a delayed service, or a person wrongly flagged. A false negative can be money lost, harm missed, or abuse allowed through. When I say “match the metric to the decision”, I mean match it to the harm profile and the workload you are creating for humans.

I also want you to respect baselines. If a human process already catches 95 percent of bad cases, your model must beat that in a way that reduces harm, not just moves it around. If the model makes the review queue unbearable, it will be disabled. I have seen this happen more times than I would like.

Evaluation: good, bad, best practice
Good practice
Pick a baseline and measure against it. Then write what you will do when the model is wrong. That sentence forces clarity.
Bad practice
Best practice

CPD evidence prompt (copy friendly)

Write this as a short note for your CPD log. Keep it honest and specific.

CPD note template
What I studied
Training dynamics, feature representation, evaluation metrics, and why offline accuracy is not enough for production decisions.
What I practised
What changed in my practice
Evidence artefact

Model evaluation in practice

Separate data, test honestly, then monitor in the real world.

Training data: learn patterns and fit parameters
Validation data: tune choices and catch overfitting early
Test data: final check on untouched examples
Deployment: real users and real consequences
Monitoring loop: watch metrics, drift, and failure cases

Quick check: evaluation and metrics

Why can accuracy be misleading in a spam filter

What does accuracy measure

What is precision in plain terms

What is recall in plain terms

Scenario: False positives are expensive and embarrassing. Which direction do you usually tune, precision or recall

Why can training performance look better than real world performance

What is a common sign of overfitting

What is distribution shift

Why must evaluation match the real world use

Name one silent production failure mode

Deployment, monitoring and drift

Concept block
Deployment pattern
Deployment is choosing a pattern that matches latency, cost, and safety.
Deployment is choosing a pattern that matches latency, cost, and safety.
Assumptions
Latency budget is known
Costs are measured
Failure modes
Hidden compute cost
Missing fallbacks

Deployment is where good models go to die. The same model can behave very differently depending on latency, scaling, input validation, and how the product uses the output. A clean offline score does not protect you from a broken data pipeline, missing logging, or a workflow that encourages people to over trust the system.

Monitoring is your early warning system. You watch three things.

  1. Inputs. Are users or upstream systems sending different data than before.
  2. Outputs. Are prediction rates, errors, and edge cases changing.
  3. System health. Are latency and failures rising, so the model is skipped or timeouts happen.

Drift is often a slow change, so the first sign is a small shift in metrics, not an outage. You should design for action. Who investigates. Who can pause the feature. What is the safe fallback.

Deployment and monitoring loop

Treat model changes like a release, then watch reality.

Validate inputs and log key fields before scoring
Run the model with a safe timeout and fallback
Track outputs, rates, and edge case failures
Watch drift and data quality over time
Decide: investigate, roll back, retrain, or update policy

Quick check: deployment, monitoring and drift

Why can a model fail after deployment even if offline tests look good

Name three monitoring areas for production AI

What is drift in plain terms

Why is logging important in a model service

What is a safe fallback

Scenario: Monitoring shows a sudden jump in “high risk” predictions after a product change. What is a sensible first step

What should happen when monitoring flags a serious risk

Why do timeouts matter

What is a practical sign of input drift

What is a practical sign of output drift

Responsible AI, limits and deployment risks

Concept block
Governance is enforcement
Governance works when it is enforced by the system and measured by evidence.
Governance works when it is enforced by the system and measured by evidence.
Assumptions
Owners exist
Evidence is reviewable
Failure modes
Policy only in text
No change control

AI systems do not understand intent or truth. They learn patterns that were useful in the data they saw. That can look like understanding because the outputs are fluent or confident. Underneath, the model is still guessing based on correlations. If the context changes, the guess changes.

This creates two different kinds of failure. Capability limits are what the model cannot reliably do, even with good governance. A content moderation model might struggle with sarcasm or coded language. A hiring model might not detect that a job description itself is biased. A credit scoring model might be accurate on last year’s economy and wrong in a downturn.

Governance failures are when the organisation deploys a system without clear goals, boundaries, or accountability. That includes using a model outside the environment it was tested for, copying a score into decisions without challenge, or treating automation as a way to avoid responsibility. These failures are common because they feel efficient right up until they become a public incident.

One practical harm is Bias can come from the data, from historical decisions you trained on, or from how you define success. A hiring tool can learn to prefer proxies for past hiring patterns. A moderation system can over flag certain dialects. A credit model can punish people who have less recorded history, even if they are good payers.

Another harm is automation overreach. If a tool is good at ranking candidates, it is tempting to let it decide who gets screened out. If a score is produced, someone will use it as if it is precise. This is how misplaced trust appears. A model is not accountable. People are.

Drift makes this worse because it is quiet. A hiring pipeline changes the applicant pool. A new fraud tactic changes patterns. A policy change changes what "normal" looks like. Without monitoring, you keep shipping decisions based on yesterday’s reality.
This is why It is not a checkbox. It has to be designed. The reviewer needs context, time, and authority. If humans are only asked to rubber stamp, you have automation with a delay, not oversight.
Responsible deployment also needs explainability and accountability. This can be simple, like showing which signals mattered most, or which policy rule was triggered. If nobody owns the harm, harm continues.

Responsible AI is an engineering discipline. It is data work, evaluation work, monitoring work, and incident response work. Ethics matters, but the day to day work is building systems that fail safely, surface uncertainty, and keep humans responsible for decisions.

AI system risk lifecycle

Where risks appear and where human review and governance must apply.

Data collection: consent, quality checks, bias review
Training: document assumptions, track versions, limit leakage
Evaluation: test for harms, stress cases, threshold choices
Deployment: workflow design, human review, safe fallbacks
Monitoring: drift checks, complaints, incident signals
Intervention points: pause, rollback, retrain, policy change

Quick check: responsible AI and deployment risks

Why do AI systems not understand intent or truth

What is the difference between a capability limit and a governance failure

Give one example of bias in a real system

Why is automation overreach risky

Scenario: A team copies a model score into a decision and says 'the model decided'. What governance failure is this

What is drift and why is it dangerous

What does human in the loop mean in practice

Why does explainability matter

What does accountability mean for an AI system

Why is responsible AI an engineering discipline

Quick feedback

Optional. This helps improve accuracy and usefulness. No accounts required.