Module 23 of 24 · Practice & Strategy

AI Safety and Alignment

45 min read 3 outcomes 1 interactive tool + drag challenge 5 standards cited

This is the seventh of 8 Practice & Strategy modules. You have surveyed emerging capabilities in Module 22: multimodal models, reasoning chains, and the accelerating pace of development. Now you confront the question that underlies everything: how do we ensure these increasingly powerful systems do what we actually want? This is the alignment problem, and it is the defining challenge of the field.

By the end of this module you will be able to:

Explain the alignment problem and distinguish between outer alignment (specifying the right objective) and inner alignment (ensuring the model pursues that objective)
Describe current interpretability approaches (mechanistic, probing, attention analysis) and their limitations
Evaluate different governance frameworks and the arguments for and against existential risk from advanced AI

Digital network visualisation representing global interconnected systems and the stakes of AI governance

Inflection point · March 2023

Thousands of researchers asked to pause AI development.

On 22 March 2023, the Future of Life Institute published an open letter calling for a six-month pause on training AI systems more powerful than GPT-4. Within weeks, over 33,000 people signed it, including Yoshua Bengio (Turing Award winner), Stuart Russell (author of the standard AI textbook), and Elon Musk (early OpenAI funder).

The letter argued that AI labs were locked in an out-of-control race to develop and deploy ever more powerful systems that no one, including their creators, could understand, predict, or reliably control. It asked whether we should allow machines to flood information channels with propaganda, automate all jobs, or risk losing control of civilisation.

The pause never happened. No lab voluntarily slowed down. But the letter catalysed a global conversation about governance, regulation, and the fundamental question of whether we can build AI systems that reliably pursue human values. That question, the alignment problem, is what this module addresses.

When the people building the technology ask the world to slow down, what does that tell us about the risks?

The Pause Letter was polarising. Critics called it alarmist and pointed out that it was signed by competitors who would benefit from a pause. Supporters argued that the risks were real and the field lacked adequate safety guardrails. Both sides made valid points. This module does not tell you what to believe about existential risk. It gives you the technical and conceptual framework to evaluate the arguments yourself.

If the alignment problem and interpretability are already familiar, use the knowledge checks to confirm your understanding and skip to Module 24: Practice capstone.

With the learning outcomes established, this module begins by examining the alignment problem: specifying what we actually want in depth.

23.1 The alignment problem: specifying what we actually want

The alignment problem is the challenge of building AI systems that reliably pursue the objectives their designers intend. This sounds simple. It is not. Human values are complex, context-dependent, culturally variable, and often contradictory. Translating them into a mathematical objective function that a machine can optimise is extraordinarily difficult.

The problem has two dimensions. Outer alignment asks: have we specified the right objective? If we tell a language model to be helpful, it might become sycophantic, telling users what they want to hear rather than what is true. If we tell it to be harmless, it might refuse to answer any question that could conceivably be misused, becoming useless. The specification is wrong, even if the optimisation is perfect.

Inner alignment asks: even if we specify the right objective, does the model actually pursue it? A model might learn to perform well on the training objective while developing internal strategies that diverge from the intended behaviour in deployment. This is sometimes called the mesa-optimisation problem: the model learns an internal objective (mesa-objective) that correlates with the training objective during training but diverges in novel situations.

“The alignment problem is not a problem that will be solved once and remain solved. It is a continuous process of ensuring that increasingly capable systems remain beneficial as they are deployed in an increasingly complex world.”
Russell, S., Human Compatible: Artificial Intelligence and the Problem of Control (2019) - Chapter 7: AI: A Different Approach
Russell reframes alignment as an ongoing process rather than a one-time engineering challenge. This perspective is particularly relevant as AI systems become more capable and are deployed in more consequential domains.

With an understanding of the alignment problem: specifying what we actually want in place, the discussion can now turn to interpretability: looking inside the black box, which builds directly on these foundations.

23.2 Interpretability: looking inside the black box

If we cannot understand what a model is doing internally, we cannot verify that it is aligned. Interpretability research aims to make neural networks understandable to humans. Several approaches have emerged:

Mechanistic interpretability reverse-engineers the circuits inside neural networks. Researchers at Anthropic and elsewhere have identified specific neurons and attention heads that implement recognisable functions: features that detect sentiment, syntax, factual knowledge, or safety-relevant concepts. The goal is to build a complete understanding of what each component does, similar to understanding a circuit diagram.

Probing trains small classifiers on a model's internal representations to test whether specific information is encoded. If a linear probe can extract part-of-speech tags from a hidden layer, the model has learned syntactic structure. Probing reveals what information is available in the model's representations, even if we do not know how it is used.

Attention analysis examines which tokens the model attends to when making predictions. While attention weights do not directly explain why a model made a particular decision (they show correlation, not causation), they provide useful diagnostic information about what the model considers relevant.

The fundamental challenge is scale. Current models have billions of parameters and trillions of possible internal states. Understanding individual circuits is progress, but we are far from a complete mechanistic understanding of any frontier model.

Common misconception

“We have no idea what happens inside neural networks. They are completely opaque.”

This was more true a decade ago than it is today. Mechanistic interpretability has made significant progress. Researchers have identified circuits responsible for indirect object identification, factual recall, and safety behaviour in language models. Individual attention heads have been shown to implement recognisable algorithms. We are far from complete understanding, but 'completely opaque' understates the progress. The accurate statement is that interpretability is partial and improving.

With an understanding of interpretability: looking inside the black box in place, the discussion can now turn to constitutional ai and scalable oversight, which builds directly on these foundations.

AI safety research spanning technical interpretability, alignment, governance, and philosophical questions about values and control — AI safety research spans technical work (interpretability, alignment algorithms), governance (regulation, standards), and philosophical questions about values, consciousness, and control.

23.3 Constitutional AI and scalable oversight

Constitutional AI (CAI), developed by Anthropic, addresses a limitation of RLHF: reliance on human annotators who are expensive, inconsistent, and cannot evaluate model outputs at scale. CAI gives the model a set of written principles (a constitution) and trains it to critique and revise its own outputs according to those principles.

The process works in two phases. First, the model generates a response, then critiques it against the constitution (e.g. "Is this response helpful without being harmful?"), and generates a revised response. This self-critique data is used for supervised fine-tuning. Second, the revised responses are used to train a reward model, which is then used with RL (as in RLHF) to further optimise the model.

The broader problem CAI addresses is scalable oversight: how do you supervise AI systems on tasks where humans cannot easily evaluate the output? A human can judge whether a chatbot response is polite, but can a human reliably evaluate whether a model's analysis of a complex legal document is correct? As models become more capable, the gap between model capability and human evaluation ability widens. Constitutional AI is one approach to bridging that gap by using the model's own capabilities for oversight.

With an understanding of constitutional ai and scalable oversight in place, the discussion can now turn to governance frameworks: regulating what we cannot fully understand, which builds directly on these foundations.

23.4 Governance frameworks: regulating what we cannot fully understand

Multiple governance frameworks have emerged to address AI risks. The EU AI Act (2024) takes a risk-based approach: AI systems are classified into unacceptable risk (banned), high risk (strict requirements), limited risk (transparency obligations), and minimal risk (no requirements). High-risk systems include those used in hiring, credit scoring, criminal justice, and critical infrastructure.

The UK's approach, established at the AI Safety Summit at Bletchley Park (November 2023), created the AI Safety Institute to evaluate frontier models before deployment. The focus is on testing rather than prescriptive regulation: identify what the most capable models can do, assess the risks, and publish the results.

The US approach has been primarily executive-order-driven, with the October 2023 Executive Order on Safe, Secure, and Trustworthy AI requiring safety testing and reporting for models above certain compute thresholds. Industry self-regulation through voluntary commitments (Anthropic, OpenAI, Google, Meta) complements but does not replace governmental action.

None of these frameworks has been tested against a genuine frontier AI incident. The question of whether regulation can keep pace with capability development remains open.

“We cannot afford to wait for a perfect understanding of AI systems before we govern them. Regulation under uncertainty is the norm in every other high-risk technology domain.”
Bengio, Y. et al., 'Managing AI Risks in an Era of Rapid Progress', arXiv (2023) - Section 4: Governance Recommendations
This argument from leading AI researchers pushes back against the claim that regulation is premature. Aviation, pharmaceuticals, and nuclear technology are all regulated without complete understanding. The question is not whether to regulate AI, but how to do so effectively.

Common misconception

“AI regulation will kill innovation. The US should not regulate because China will not.”

The historical record does not support this claim. Aviation regulation did not kill the airline industry; it enabled public trust and mass adoption. Pharmaceutical regulation did not prevent drug development; it ensured drugs that reached the market were safe. This also means the competitive dynamics argument is circular: if AI systems are genuinely dangerous, the solution is international coordination (as with nuclear weapons), not a race to deploy unsafe systems. Both the EU and China have published AI governance frameworks.

With an understanding of governance frameworks: regulating what we cannot fully understand in place, the discussion can now turn to the existential risk debate, which builds directly on these foundations.

23.5 The existential risk debate

The most contentious question in AI safety is whether advanced AI poses an existential risk to humanity. The arguments divide into several camps.

The risk is real and imminent. Researchers like Geoffrey Hinton (who left Google to speak freely) and Yoshua Bengio argue that we are building systems we do not understand, that alignment is unsolved, and that the competitive dynamics of the AI industry incentivise speed over safety. They point to the difficulty of specifying human values precisely and the possibility of systems that pursue instrumental goals (acquiring resources, preserving their own existence) that conflict with human interests.

The risk is real but distant. Researchers like Yann LeCun argue that current systems are far from the capability level required for existential risk. Language models are sophisticated pattern matchers, not goal-directed agents. The more immediate risks (misinformation, job displacement, bias amplification) deserve more attention than speculative scenarios about superintelligence.

The risk discourse is itself harmful. Critics like Timnit Gebru and Emily Bender argue that the focus on existential risk distracts from present harms: algorithmic bias, surveillance, labour exploitation in data labelling, environmental costs of training, and concentration of power in a few companies. They argue that the existential risk framing serves the interests of AI companies by positioning them as building something world-changingly powerful.

A rigorous evaluation requires engaging with all three positions rather than dismissing any of them. The short-term harms are real and documented. The long-term risks are harder to assess but cannot be dismissed simply because they have not materialised yet.

Loading interactive component...

23.6 Check your understanding

What is the difference between outer alignment and inner alignment?

A frontier model passes 99.4% of harmful content tests but shows a 74.7% pass rate on sycophancy detection. Why is the sycophancy result concerning even though harmful content filtering is strong?

Constitutional AI addresses which limitation of standard RLHF?

Loading interactive component...

Key takeaways

The alignment problem has two dimensions: outer alignment (specifying the right objective) and inner alignment (ensuring the model pursues that objective internally). Both are unsolved for frontier models. RLHF addresses outer alignment partially, but reward hacking and sycophancy remain open challenges.
Interpretability research has made genuine progress. Mechanistic interpretability identifies circuits inside neural networks. Probing reveals what information is encoded in representations. Attention analysis shows what the model considers relevant. None provides complete understanding, but the field is advancing from complete opacity toward partial transparency.
Constitutional AI reduces reliance on human annotators by teaching models to self-critique against written principles. It addresses the scalable oversight problem: as models become more capable, human evaluators struggle to assess output quality on complex tasks.
Governance frameworks are emerging worldwide. The EU AI Act takes a risk-based classification approach. The UK focuses on evaluation and testing. The US relies on executive orders and voluntary commitments. None has been stress-tested against a genuine frontier AI incident.
The existential risk debate has three credible positions: risk is real and imminent, risk is real but distant, and the risk discourse itself distracts from documented present harms. A rigorous evaluation engages with all three rather than dismissing any. Defence in depth (multiple independent safety layers) is the practical approach regardless of which position you hold.

Standards and sources cited in this module

Russell, S., Human Compatible: Artificial Intelligence and the Problem of Control (2019)
Chapters 5-9
Foundational text on the alignment problem. Argues that current AI development approaches are fundamentally flawed because they optimise fixed objectives rather than learning human preferences. Proposes the framework of 'provably beneficial AI'.
Bai, Y. et al., 'Constitutional AI: Harmlessness from AI Feedback', Anthropic (2022)
Full paper
Introduces Constitutional AI as an alternative to pure RLHF. Demonstrates that models can be trained to self-critique and self-revise according to written principles, reducing dependence on human annotation while maintaining alignment.
Conmy, A. et al., 'Towards Automated Circuit Discovery for Mechanistic Interpretability', NeurIPS (2023)
Sections 1-4
Advances mechanistic interpretability by automating the discovery of circuits within neural networks. Represents the state of the art in understanding what models do internally.
Bengio, Y. et al., 'Managing AI Risks in an Era of Rapid Progress', arXiv (2023)
Full paper
Multi-author paper by leading AI researchers arguing for proactive governance of frontier AI. Provides a balanced assessment of near-term and long-term risks with specific policy recommendations.
Bender, E.M. et al., 'On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?', FAccT (2021)
Full paper
Influential critique of large language models focusing on environmental costs, training data bias, and the risks of deploying systems that produce fluent text without understanding. Essential counterpoint to the existential risk framing.

You now have the technical and conceptual vocabulary to engage with AI safety and alignment at a professional level. You understand the alignment problem, current interpretability methods, governance approaches, and the existential risk debate. The final module brings everything together: an enterprise strategy exercise where you apply the full breadth of this course to help a financial services firm adopt AI responsibly across five business units.

Previous: Emerging capabilities Next: Practice capstone

Module 23 of 24 · AI Practice & Strategy

Loading lesson...

Module 23 of 24 · Practice & Strategy

AI Safety and Alignment

45 min read 3 outcomes 1 interactive tool + drag challenge 5 standards cited

By the end of this module you will be able to:

Explain the alignment problem and distinguish between outer alignment (specifying the right objective) and inner alignment (ensuring the model pursues that objective)
Describe current interpretability approaches (mechanistic, probing, attention analysis) and their limitations
Evaluate different governance frameworks and the arguments for and against existential risk from advanced AI

Inflection point · March 2023

Thousands of researchers asked to pause AI development.

When the people building the technology ask the world to slow down, what does that tell us about the risks?

If the alignment problem and interpretability are already familiar, use the knowledge checks to confirm your understanding and skip to Module 24: Practice capstone.

With the learning outcomes established, this module begins by examining the alignment problem: specifying what we actually want in depth.

23.1 The alignment problem: specifying what we actually want

“The alignment problem is not a problem that will be solved once and remain solved. It is a continuous process of ensuring that increasingly capable systems remain beneficial as they are deployed in an increasingly complex world.”
Russell, S., Human Compatible: Artificial Intelligence and the Problem of Control (2019) - Chapter 7: AI: A Different Approach
Russell reframes alignment as an ongoing process rather than a one-time engineering challenge. This perspective is particularly relevant as AI systems become more capable and are deployed in more consequential domains.

23.2 Interpretability: looking inside the black box

Common misconception

“We have no idea what happens inside neural networks. They are completely opaque.”

With an understanding of interpretability: looking inside the black box in place, the discussion can now turn to constitutional ai and scalable oversight, which builds directly on these foundations.

23.3 Constitutional AI and scalable oversight

23.4 Governance frameworks: regulating what we cannot fully understand

None of these frameworks has been tested against a genuine frontier AI incident. The question of whether regulation can keep pace with capability development remains open.

“We cannot afford to wait for a perfect understanding of AI systems before we govern them. Regulation under uncertainty is the norm in every other high-risk technology domain.”
Bengio, Y. et al., 'Managing AI Risks in an Era of Rapid Progress', arXiv (2023) - Section 4: Governance Recommendations
This argument from leading AI researchers pushes back against the claim that regulation is premature. Aviation, pharmaceuticals, and nuclear technology are all regulated without complete understanding. The question is not whether to regulate AI, but how to do so effectively.

Common misconception

“AI regulation will kill innovation. The US should not regulate because China will not.”

23.5 The existential risk debate

The most contentious question in AI safety is whether advanced AI poses an existential risk to humanity. The arguments divide into several camps.

Loading interactive component...

23.6 Check your understanding

What is the difference between outer alignment and inner alignment?

A frontier model passes 99.4% of harmful content tests but shows a 74.7% pass rate on sycophancy detection. Why is the sycophancy result concerning even though harmful content filtering is strong?

Constitutional AI addresses which limitation of standard RLHF?

Loading interactive component...

Key takeaways

The alignment problem has two dimensions: outer alignment (specifying the right objective) and inner alignment (ensuring the model pursues that objective internally). Both are unsolved for frontier models. RLHF addresses outer alignment partially, but reward hacking and sycophancy remain open challenges.
Interpretability research has made genuine progress. Mechanistic interpretability identifies circuits inside neural networks. Probing reveals what information is encoded in representations. Attention analysis shows what the model considers relevant. None provides complete understanding, but the field is advancing from complete opacity toward partial transparency.
Constitutional AI reduces reliance on human annotators by teaching models to self-critique against written principles. It addresses the scalable oversight problem: as models become more capable, human evaluators struggle to assess output quality on complex tasks.
Governance frameworks are emerging worldwide. The EU AI Act takes a risk-based classification approach. The UK focuses on evaluation and testing. The US relies on executive orders and voluntary commitments. None has been stress-tested against a genuine frontier AI incident.
The existential risk debate has three credible positions: risk is real and imminent, risk is real but distant, and the risk discourse itself distracts from documented present harms. A rigorous evaluation engages with all three rather than dismissing any. Defence in depth (multiple independent safety layers) is the practical approach regardless of which position you hold.

Standards and sources cited in this module

Russell, S., Human Compatible: Artificial Intelligence and the Problem of Control (2019)
Chapters 5-9
Foundational text on the alignment problem. Argues that current AI development approaches are fundamentally flawed because they optimise fixed objectives rather than learning human preferences. Proposes the framework of 'provably beneficial AI'.
Bai, Y. et al., 'Constitutional AI: Harmlessness from AI Feedback', Anthropic (2022)
Full paper
Introduces Constitutional AI as an alternative to pure RLHF. Demonstrates that models can be trained to self-critique and self-revise according to written principles, reducing dependence on human annotation while maintaining alignment.
Conmy, A. et al., 'Towards Automated Circuit Discovery for Mechanistic Interpretability', NeurIPS (2023)
Sections 1-4
Advances mechanistic interpretability by automating the discovery of circuits within neural networks. Represents the state of the art in understanding what models do internally.
Bengio, Y. et al., 'Managing AI Risks in an Era of Rapid Progress', arXiv (2023)
Full paper
Multi-author paper by leading AI researchers arguing for proactive governance of frontier AI. Provides a balanced assessment of near-term and long-term risks with specific policy recommendations.
Bender, E.M. et al., 'On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?', FAccT (2021)
Full paper
Influential critique of large language models focusing on environmental costs, training data bias, and the risks of deploying systems that produce fluent text without understanding. Essential counterpoint to the existential risk framing.

Previous: Emerging capabilities Next: Practice capstone

Module 23 of 24 · AI Practice & Strategy