Security and ethics ยท Module 1
The threat landscape
AI agents face unique security challenges that traditional software does not.
Previously
Start with Security and ethics
Critical understanding of AI security threats and responsible deployment.
This module
The threat landscape
AI agents face unique security challenges that traditional software does not.
Next
Secure implementation
Every piece of data that enters your agent system is a potential attack vector.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
OWASP maintains a widely used list of risks and mitigations for LLM and generative AI applications.
What you will be able to do
- 1 Identify major threats that are specific to agent systems.
- 2 Explain prompt injection and why it is hard to eliminate completely.
- 3 Assess risk based on what tools an agent can access.
Before you begin
- Core concepts and practical building context
- Awareness of misuse patterns and safety boundaries
Common ways people get this wrong
- Prompt injection. Hidden instructions change behaviour, often by asking the agent to ignore its rules.
- Data exfiltration. The agent leaks secrets through output, logs, or tool parameters.
Main idea at a glance
Diagram
Stage 1
User Input
Any instruction from a user or attacker-controlled source
This is the entry point where prompt injection begins
4.1.1 Understanding AI Agent Threats
AI agents face unique security challenges that traditional software does not. When you give an AI the ability to act in the world (send emails, write files, execute code, browse the web), you create attack surfaces that did not exist before.
I think of it this way. A chatbot that can only respond with text has limited attack potential. An agent that can access your email, calendar, and file system is a completely different risk profile.
OWASP Top 10 for LLM and generative AI applications
OWASP maintains a widely used list of risks and mitigations for LLM and generative AI applications. The 2025 list is a sensible starting point for agent style systems [Source].
Diagram
Stage 1
LLM01. Prompt Injection
Malicious instructions inserted into prompts, causing the model to ignore original instructions and follow attacker commands instead
I think this is the single highest-risk vulnerability in agent systems today
Let me walk you through the most critical threats.
OWASP Top 10 for Agentic Applications
OWASP published a separate Agentic Applications Top 10 in December 2025 [Source]. This distinction matters. The LLM Top 10 focuses on risks at the model and application layer, such as prompt injection, leakage, and insecure output handling. The Agentic Top 10 focuses on what changes when you give a system tools, memory, delegated actions, and the ability to persist across steps.
I think of it this way. The LLM Top 10 asks "what can go wrong with the brain?" The Agentic Top 10 asks "what can go wrong when you give that brain arms and legs?"
Diagram
Stage 1
ASI01. Agent Goal Hijack
Malicious instructions embedded in content redirect the agent from its intended objective to the attacker's goal
I think of this as the agent equivalent of a takeover. It is the most serious agentic risk
Let me walk through the five most critical risks from this list.
ASI01. Agent Goal Hijack
Agent Goal Hijack (ASI01)
An attack that redirects an agent away from its intended objective by embedding malicious instructions in content the agent processes. The agent's goal is "hijacked" so it works towards the attacker's objective instead of the user's.
You might be thinking "this sounds like prompt injection" and you are partly right. Agent goal hijack builds on prompt injection but goes further. A simple chatbot that gets prompt-injected might produce a rude response. An agent that gets its goal hijacked might spend hours systematically exfiltrating data, purchasing products, or modifying files because it has the autonomy to carry out multi-step plans.
The defence here is to constrain the agent's action space. If the CV review agent cannot send emails, the second instruction fails even if the goal hijack succeeds. Limit what agents can do, not just what they are told.
ASI02. Tool Misuse and Exploitation
Tool Misuse (ASI02)
When an agent uses a legitimate, authorised tool in an unintended or harmful way because its reasoning has been manipulated through crafted inputs or corrupted context.
This is one of the risks that keeps me up at night. The tools themselves are fine. The permissions are correct. The API keys are valid. But the agent's decision about how to use them has been influenced by an attacker.
ASI06. Memory and Context Poisoning
Memory and Context Poisoning (ASI06)
Attacks that corrupt an agent's persistent memory, RAG knowledge base, or conversation context so that the agent operates on false information in future interactions.
This is far worse for agents than for simple chatbots. A chatbot with a poisoned RAG store might give wrong answers to questions. An agent with poisoned memory might take wrong actions repeatedly over days or weeks, because it "remembers" false facts and uses them to make decisions.
The defence is to treat memory and RAG stores as attack surfaces. Validate what goes in. Version control your knowledge base. Monitor for unexpected changes. And never let agents write to their own long-term memory without oversight.
ASI08. Cascading Failures
Cascading Failures (ASI08)
When a small error, misinterpretation, or security breach in one agent propagates through connected agent systems, amplifying the damage at each step.
Multi-agent systems are becoming more common. You might have an agent that gathers data, another that analyses it, and a third that takes action based on the analysis. If the first agent makes an error, that error flows downstream. Each agent treats the previous agent's output as trusted input. The error grows.
The principle here is that agents in a chain should not blindly trust each other. Build validation checks between agents. Set thresholds for when a human must review. Log every inter-agent handoff so you can trace the failure path.
ASI10. Rogue Agents
Rogue Agents (ASI10)
Agents that have been compromised, misconfigured, or that exhibit emergent misaligned behaviour while continuing to appear legitimate and trustworthy to users and other systems.
This is the scenario that gets the most attention in the press, but it is real. A rogue agent does not announce itself. It continues to produce mostly correct outputs while subtly working against the user's interests. It might be the result of a supply chain attack, a fine-tuning poisoning attack, or emergent behaviour from poorly specified objectives.
4.1.2 LLM01. Prompt injection
Prompt injection is the single biggest security risk facing AI agents today. It is also, unfortunately, one that cannot be completely solved. Let me explain why.
Prompt Injection
An attack where malicious instructions are inserted into an AI system's input, causing it to ignore its original instructions and follow the attacker's commands instead.
How it works
When you interact with an AI agent, your message gets combined with the system's instructions into a single prompt. The AI has no way to distinguish between "official" instructions from the developer and "unofficial" instructions from you, the user, or from content the agent processes.
Diagram
Stage 1
System Instructions
The base instructions that define the agent's role, behaviour, and constraints
I think of system instructions as the intended rules, but they are not actually special to the model
Types of prompt injection
1. Direct Prompt Injection
The user directly inputs malicious instructions.
Modern LLMs have some resistance to obvious direct injections, but creative attackers find ways around these defences. The cat-and-mouse game continues.
2. Indirect Prompt Injection
This is more dangerous. Malicious instructions are hidden in content the AI processes, not in the user's direct input.
Diagram
Stage 1
Attacker Plants Instructions
An attacker embeds malicious instructions as white text, comments, or other hidden content in a document or website
I think this is particularly insidious because the user cannot see the attack
Why prompt injection is hard to eliminate
In practice, prompt injection is a risk you reduce, not a problem you solve once and forget. The most reliable approach is layered controls and safe failure, rather than a single "magic" detector [Source].
No clean boundary. LLMs struggle to distinguish between instructions and data. Everything is combined into one prompt. There is no direct equivalent of prepared statements.
Probabilistic behaviour. AI behaviour is not deterministic. A defence that works most of the time still fails sometimes. At scale, small failure rates become frequent incidents.
Adversarial creativity. Attackers actively adapt. They test prompts until they find a bypass. Jailbreaks often evolve faster than defensive patterns.
What this means for your agents
Never give AI agents access to truly sensitive operations without human approval
Assume any AI system can be manipulated given sufficient attacker motivation
Design systems to fail safely when manipulation occurs
A useful scoping heuristic
Be especially careful when one agent has all three of these properties at once: access to untrusted data, the ability to change state, and unrestricted tool use. That combination sharply increases the chance of harmful prompt injection or tool misuse. Treat it as a signal to reduce permissions or add human approval.
My opinion is that Meta's Rule of Two is one of the clearest mental models we have for agent security. Before you build, ask yourself: does this agent read untrusted content, change things in the real world, and use tools? If the answer is yes to all three, you need to rethink the design or add very strong guardrails.
๐ฏ Interactive. Prompt injection defence practice
This hands-on practice tool helps you understand prompt injection attack patterns and how to defend against them. Study attack examples, test your own inputs for suspicious patterns, and learn about defence in depth strategies.
Interactive lab
Prompt Injection Defense
This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.
4.1.3 Supply Chain Vulnerabilities
Your AI agent does not exist in isolation. It depends on dozens, sometimes hundreds, of external components.
Diagram
Stage 1
Your Agent Code
The code you write that orchestrates tools and interacts with models
I think your code is only as safe as its dependencies
Supply chain problems are common in package ecosystems. Typosquatting, compromised maintainers, and malicious updates happen in every language. The safest default is to treat anything you install as potentially hostile, then design controls that limit the blast radius.
Protection measures
Pin dependency versions to specific releases you have audited
Use vulnerability scanning tools like
npm auditorpip-auditVerify package authenticity through checksums and signatures
Maintain Software Bills of Materials (SBOM) for all deployments
# Good: Pinned versions in requirements.txt
langchain==0.1.5
ollama==0.1.7
requests==2.31.0
# Bad: Unpinned versions (dangerous!)
langchain
ollama
requests4.1.4 Risk Assessment by Deployment Scenario
4.1.4 Risk Assessment by Deployment Scenario
Not all AI deployments carry the same risk. A personal assistant running on your laptop has fundamentally different risks than a customer-facing chatbot handling payment information.
Proportionate Security Approach
For Personal/Local Use:
โ Use local models (Ollama)
โ Keep software updated
โ Basic input validation
โ ๏ธ Do not connect to sensitive accounts
For Team/Business Use:
โ All of the above
โ Role-based access control
โ Audit logging
โ Regular security reviews
โ ๏ธ Limit external data access
For Public Deployment:
โ All of the above
โ Professional security audit
โ Continuous monitoring
โ Incident response plan
โ Insurance/liability coverage
โ Human-in-the-loop for critical actions
Mental model
Threats follow the tool path
Most agent threats are about untrusted input reaching powerful tools or sensitive data.
-
1
Untrusted input
-
2
Agent
-
3
Tool call
-
4
Sensitive data
-
5
Harm
Assumptions to keep in mind
- All input is untrusted. Treat user text, tool output, and retrieved documents as attacker controlled until proven otherwise.
- Tools are restricted. An agent with unrestricted tools is a privileged account with no training.
Failure modes to notice
- Prompt injection. Hidden instructions change behaviour, often by asking the agent to ignore its rules.
- Data exfiltration. The agent leaks secrets through output, logs, or tool parameters.
Key terms
- Agent Goal Hijack (ASI01)
- An attack that redirects an agent away from its intended objective by embedding malicious instructions in content the agent processes. The agent's goal is "hijacked" so it works towards the attacker's objective instead of the user's.
- Tool Misuse (ASI02)
- When an agent uses a legitimate, authorised tool in an unintended or harmful way because its reasoning has been manipulated through crafted inputs or corrupted context.
- Memory and Context Poisoning (ASI06)
- Attacks that corrupt an agent's persistent memory, RAG knowledge base, or conversation context so that the agent operates on false information in future interactions.
- Cascading Failures (ASI08)
- When a small error, misinterpretation, or security breach in one agent propagates through connected agent systems, amplifying the damage at each step.
- Rogue Agents (ASI10)
- Agents that have been compromised, misconfigured, or that exhibit emergent misaligned behaviour while continuing to appear legitimate and trustworthy to users and other systems.
- Prompt Injection
- An attack where malicious instructions are inserted into an AI system's input, causing it to ignore its original instructions and follow the attacker's commands instead.
- A useful scoping heuristic
- Be especially careful when one agent has all three of these properties at once: access to untrusted data, the ability to change state, and unrestricted tool use. That combination sharply increases the chance of harmful prompt injection or tool misuse. Treat it as a signal to reduce permissions or add human approval.
Check yourself
Quick check. Threat landscape
0 of 4 opened
Why are agents riskier than chatbots
Because they can take actions using tools, which expands the attack surface and potential impact.
What is prompt injection in one sentence
An attempt to smuggle malicious instructions into what the model treats as its prompt.
Scenario. Your agent reads web pages and can send emails. What is a realistic indirect prompt injection risk
A web page contains hidden instructions that trick the agent into emailing sensitive data to an attacker.
What is the most reliable way to reduce risk from prompt injection
Defence in depth. Limit tool permissions, require approval for high impact actions, validate inputs and outputs, and log everything.
Artefact and reflection
Artefact
A short threat list for an agent you want to build.
Reflection
Where in your work would identify major threats that are specific to agent systems. change a decision, and what evidence would make you trust that change?
Optional practice
Write down one realistic prompt injection attempt for your own workflow.