Security and ethics ยท Module 1

The threat landscape

AI agents face unique security challenges that traditional software does not.

40 min 3 outcomes Security and ethics

Previously

Start with Security and ethics

Critical understanding of AI security threats and responsible deployment.

This module

The threat landscape

AI agents face unique security challenges that traditional software does not.

Next

Secure implementation

Every piece of data that enters your agent system is a potential attack vector.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

OWASP maintains a widely used list of risks and mitigations for LLM and generative AI applications.

What you will be able to do

  • 1 Identify major threats that are specific to agent systems.
  • 2 Explain prompt injection and why it is hard to eliminate completely.
  • 3 Assess risk based on what tools an agent can access.

Before you begin

  • Core concepts and practical building context
  • Awareness of misuse patterns and safety boundaries

Common ways people get this wrong

  • Prompt injection. Hidden instructions change behaviour, often by asking the agent to ignore its rules.
  • Data exfiltration. The agent leaks secrets through output, logs, or tool parameters.

Main idea at a glance

Diagram

Stage 1

User Input

Any instruction from a user or attacker-controlled source

This is the entry point where prompt injection begins

4.1.1 Understanding AI Agent Threats

AI agents face unique security challenges that traditional software does not. When you give an AI the ability to act in the world (send emails, write files, execute code, browse the web), you create attack surfaces that did not exist before.

I think of it this way. A chatbot that can only respond with text has limited attack potential. An agent that can access your email, calendar, and file system is a completely different risk profile.

OWASP Top 10 for LLM and generative AI applications

OWASP maintains a widely used list of risks and mitigations for LLM and generative AI applications. The 2025 list is a sensible starting point for agent style systems [Source].

Diagram

Stage 1

LLM01. Prompt Injection

Malicious instructions inserted into prompts, causing the model to ignore original instructions and follow attacker commands instead

I think this is the single highest-risk vulnerability in agent systems today

Let me walk you through the most critical threats.

OWASP Top 10 for Agentic Applications

OWASP published a separate Agentic Applications Top 10 in December 2025 [Source]. This distinction matters. The LLM Top 10 focuses on risks at the model and application layer, such as prompt injection, leakage, and insecure output handling. The Agentic Top 10 focuses on what changes when you give a system tools, memory, delegated actions, and the ability to persist across steps.

I think of it this way. The LLM Top 10 asks "what can go wrong with the brain?" The Agentic Top 10 asks "what can go wrong when you give that brain arms and legs?"

Diagram

Stage 1

ASI01. Agent Goal Hijack

Malicious instructions embedded in content redirect the agent from its intended objective to the attacker's goal

I think of this as the agent equivalent of a takeover. It is the most serious agentic risk

Let me walk through the five most critical risks from this list.

ASI01. Agent Goal Hijack

Agent Goal Hijack (ASI01)

An attack that redirects an agent away from its intended objective by embedding malicious instructions in content the agent processes. The agent's goal is "hijacked" so it works towards the attacker's objective instead of the user's.

You might be thinking "this sounds like prompt injection" and you are partly right. Agent goal hijack builds on prompt injection but goes further. A simple chatbot that gets prompt-injected might produce a rude response. An agent that gets its goal hijacked might spend hours systematically exfiltrating data, purchasing products, or modifying files because it has the autonomy to carry out multi-step plans.

The defence here is to constrain the agent's action space. If the CV review agent cannot send emails, the second instruction fails even if the goal hijack succeeds. Limit what agents can do, not just what they are told.

ASI02. Tool Misuse and Exploitation

Tool Misuse (ASI02)

When an agent uses a legitimate, authorised tool in an unintended or harmful way because its reasoning has been manipulated through crafted inputs or corrupted context.

This is one of the risks that keeps me up at night. The tools themselves are fine. The permissions are correct. The API keys are valid. But the agent's decision about how to use them has been influenced by an attacker.

ASI06. Memory and Context Poisoning

Memory and Context Poisoning (ASI06)

Attacks that corrupt an agent's persistent memory, RAG knowledge base, or conversation context so that the agent operates on false information in future interactions.

This is far worse for agents than for simple chatbots. A chatbot with a poisoned RAG store might give wrong answers to questions. An agent with poisoned memory might take wrong actions repeatedly over days or weeks, because it "remembers" false facts and uses them to make decisions.

The defence is to treat memory and RAG stores as attack surfaces. Validate what goes in. Version control your knowledge base. Monitor for unexpected changes. And never let agents write to their own long-term memory without oversight.

ASI08. Cascading Failures

Cascading Failures (ASI08)

When a small error, misinterpretation, or security breach in one agent propagates through connected agent systems, amplifying the damage at each step.

Multi-agent systems are becoming more common. You might have an agent that gathers data, another that analyses it, and a third that takes action based on the analysis. If the first agent makes an error, that error flows downstream. Each agent treats the previous agent's output as trusted input. The error grows.

The principle here is that agents in a chain should not blindly trust each other. Build validation checks between agents. Set thresholds for when a human must review. Log every inter-agent handoff so you can trace the failure path.

ASI10. Rogue Agents

Rogue Agents (ASI10)

Agents that have been compromised, misconfigured, or that exhibit emergent misaligned behaviour while continuing to appear legitimate and trustworthy to users and other systems.

This is the scenario that gets the most attention in the press, but it is real. A rogue agent does not announce itself. It continues to produce mostly correct outputs while subtly working against the user's interests. It might be the result of a supply chain attack, a fine-tuning poisoning attack, or emergent behaviour from poorly specified objectives.

4.1.2 LLM01. Prompt injection

Prompt injection is the single biggest security risk facing AI agents today. It is also, unfortunately, one that cannot be completely solved. Let me explain why.

Prompt Injection

An attack where malicious instructions are inserted into an AI system's input, causing it to ignore its original instructions and follow the attacker's commands instead.

How it works

When you interact with an AI agent, your message gets combined with the system's instructions into a single prompt. The AI has no way to distinguish between "official" instructions from the developer and "unofficial" instructions from you, the user, or from content the agent processes.

Diagram

Stage 1

System Instructions

The base instructions that define the agent's role, behaviour, and constraints

I think of system instructions as the intended rules, but they are not actually special to the model

Types of prompt injection

1. Direct Prompt Injection

The user directly inputs malicious instructions.

Modern LLMs have some resistance to obvious direct injections, but creative attackers find ways around these defences. The cat-and-mouse game continues.

2. Indirect Prompt Injection

This is more dangerous. Malicious instructions are hidden in content the AI processes, not in the user's direct input.

Diagram

Stage 1

Attacker Plants Instructions

An attacker embeds malicious instructions as white text, comments, or other hidden content in a document or website

I think this is particularly insidious because the user cannot see the attack

Why prompt injection is hard to eliminate

In practice, prompt injection is a risk you reduce, not a problem you solve once and forget. The most reliable approach is layered controls and safe failure, rather than a single "magic" detector [Source].

  1. No clean boundary. LLMs struggle to distinguish between instructions and data. Everything is combined into one prompt. There is no direct equivalent of prepared statements.

  2. Probabilistic behaviour. AI behaviour is not deterministic. A defence that works most of the time still fails sometimes. At scale, small failure rates become frequent incidents.

  3. Adversarial creativity. Attackers actively adapt. They test prompts until they find a bypass. Jailbreaks often evolve faster than defensive patterns.

What this means for your agents

  • Never give AI agents access to truly sensitive operations without human approval

  • Assume any AI system can be manipulated given sufficient attacker motivation

  • Design systems to fail safely when manipulation occurs

A useful scoping heuristic

Be especially careful when one agent has all three of these properties at once: access to untrusted data, the ability to change state, and unrestricted tool use. That combination sharply increases the chance of harmful prompt injection or tool misuse. Treat it as a signal to reduce permissions or add human approval.

My opinion is that Meta's Rule of Two is one of the clearest mental models we have for agent security. Before you build, ask yourself: does this agent read untrusted content, change things in the real world, and use tools? If the answer is yes to all three, you need to rethink the design or add very strong guardrails.

๐ŸŽฏ Interactive. Prompt injection defence practice

This hands-on practice tool helps you understand prompt injection attack patterns and how to defend against them. Study attack examples, test your own inputs for suspicious patterns, and learn about defence in depth strategies.

Interactive lab

Prompt Injection Defense

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

4.1.3 Supply Chain Vulnerabilities

Your AI agent does not exist in isolation. It depends on dozens, sometimes hundreds, of external components.

Diagram

Stage 1

Your Agent Code

The code you write that orchestrates tools and interacts with models

I think your code is only as safe as its dependencies

Supply chain problems are common in package ecosystems. Typosquatting, compromised maintainers, and malicious updates happen in every language. The safest default is to treat anything you install as potentially hostile, then design controls that limit the blast radius.

Protection measures

  1. Pin dependency versions to specific releases you have audited

  2. Use vulnerability scanning tools like npm audit or pip-audit

  3. Verify package authenticity through checksums and signatures

  4. Maintain Software Bills of Materials (SBOM) for all deployments

# Good: Pinned versions in requirements.txt
langchain==0.1.5
ollama==0.1.7
requests==2.31.0

# Bad: Unpinned versions (dangerous!)
langchain
ollama
requests

4.1.4 Risk Assessment by Deployment Scenario

4.1.4 Risk Assessment by Deployment Scenario

Not all AI deployments carry the same risk. A personal assistant running on your laptop has fundamentally different risks than a customer-facing chatbot handling payment information.

Proportionate Security Approach

For Personal/Local Use:

  • โœ… Use local models (Ollama)

  • โœ… Keep software updated

  • โœ… Basic input validation

  • โš ๏ธ Do not connect to sensitive accounts

For Team/Business Use:

  • โœ… All of the above

  • โœ… Role-based access control

  • โœ… Audit logging

  • โœ… Regular security reviews

  • โš ๏ธ Limit external data access

For Public Deployment:

  • โœ… All of the above

  • โœ… Professional security audit

  • โœ… Continuous monitoring

  • โœ… Incident response plan

  • โœ… Insurance/liability coverage

  • โœ… Human-in-the-loop for critical actions

Mental model

Threats follow the tool path

Most agent threats are about untrusted input reaching powerful tools or sensitive data.

  1. 1

    Untrusted input

  2. 2

    Agent

  3. 3

    Tool call

  4. 4

    Sensitive data

  5. 5

    Harm

Assumptions to keep in mind

  • All input is untrusted. Treat user text, tool output, and retrieved documents as attacker controlled until proven otherwise.
  • Tools are restricted. An agent with unrestricted tools is a privileged account with no training.

Failure modes to notice

  • Prompt injection. Hidden instructions change behaviour, often by asking the agent to ignore its rules.
  • Data exfiltration. The agent leaks secrets through output, logs, or tool parameters.

Key terms

Agent Goal Hijack (ASI01)
An attack that redirects an agent away from its intended objective by embedding malicious instructions in content the agent processes. The agent's goal is "hijacked" so it works towards the attacker's objective instead of the user's.
Tool Misuse (ASI02)
When an agent uses a legitimate, authorised tool in an unintended or harmful way because its reasoning has been manipulated through crafted inputs or corrupted context.
Memory and Context Poisoning (ASI06)
Attacks that corrupt an agent's persistent memory, RAG knowledge base, or conversation context so that the agent operates on false information in future interactions.
Cascading Failures (ASI08)
When a small error, misinterpretation, or security breach in one agent propagates through connected agent systems, amplifying the damage at each step.
Rogue Agents (ASI10)
Agents that have been compromised, misconfigured, or that exhibit emergent misaligned behaviour while continuing to appear legitimate and trustworthy to users and other systems.
Prompt Injection
An attack where malicious instructions are inserted into an AI system's input, causing it to ignore its original instructions and follow the attacker's commands instead.
A useful scoping heuristic
Be especially careful when one agent has all three of these properties at once: access to untrusted data, the ability to change state, and unrestricted tool use. That combination sharply increases the chance of harmful prompt injection or tool misuse. Treat it as a signal to reduce permissions or add human approval.

Check yourself

Quick check. Threat landscape

0 of 4 opened

Why are agents riskier than chatbots

Because they can take actions using tools, which expands the attack surface and potential impact.

What is prompt injection in one sentence

An attempt to smuggle malicious instructions into what the model treats as its prompt.

Scenario. Your agent reads web pages and can send emails. What is a realistic indirect prompt injection risk

A web page contains hidden instructions that trick the agent into emailing sensitive data to an attacker.

What is the most reliable way to reduce risk from prompt injection

Defence in depth. Limit tool permissions, require approval for high impact actions, validate inputs and outputs, and log everything.

Artefact and reflection

Artefact

A short threat list for an agent you want to build.

Reflection

Where in your work would identify major threats that are specific to agent systems. change a decision, and what evidence would make you trust that change?

Optional practice

Write down one realistic prompt injection attempt for your own workflow.