Loading lesson...
Loading lesson...

Real-world research · 2023
In 2023, researchers Kai Greshake and colleagues at the CISPA Helmholtz Centre for Information Security demonstrated an attack against a LangChain-based personal assistant agent with access to the user's email inbox. LangChain is an open-source framework for building LLM applications; at the time of the research it was among the most widely adopted frameworks for agentic systems.
The attack used indirect prompt injection. The researchers sent the target account an email containing hidden instructions: text formatted to be invisible in a typical email client but readable by the agent when it processed the message. The instructions directed the agent to forward the contents of all other emails in the inbox to an external address controlled by the researchers.
The agent complied. It called the send_email tool, which was in its permitted tool set, with the contents of the user's other emails as the payload. From an infrastructure and logging perspective, everything looked normal. No credentials were stolen. No intrusion occurred. The agent simply used the tools it was given, following instructions it received through a channel it was designed to trust.
The root vulnerability was not in the LangChain framework and not in the LLM. It was in the design: no sanitisation of retrieved email content, no restriction on which addresses the agent could forward to, and no human approval gate before irreversible actions like sending email. All three gaps are addressable at implementation time.
The agent was behaving exactly as designed. It read the email, processed the content, and called the tools it was given. What part of the design was the actual vulnerability?
Module 16 identified the threats; this module implements the defences. Input validation, sandboxing, audit logging, least-privilege tool access, and defence in depth - each control is mapped to the specific threat it mitigates.
With the learning outcomes established, this module begins by examining security is a design constraint, not a feature in depth.
Retrofitting security onto an agent after it is built is expensive, incomplete, and often architecturally impossible without rebuilding significant parts of the system. The patterns in this module are most effective when applied from the initial design. This is not theoretical: the incidents that have attracted public attention in 2024-2025 almost exclusively involved agents with insufficient input validation, overly broad permissions, or no human approval gates for consequential actions.
The UK Government's AI Cyber Security Code of Practice (2024) states this principle directly: security requirements should be addressed during the design phase of an AI system, not added retrospectively. The Code of Practice is the UK's voluntary framework for organisations developing or deploying AI in production, and it aligns closely with the NIST AI Risk Management Framework (AI RMF) 1.0 Manage function (specifically Manage 2.4: Risk treatments are applied, monitored, and adjusted).
This module is organised around the five controls that address the highest-priority risks from Module 16: input validation, output filtering, least privilege, human approval gates, sandboxing for code execution, and audit logging. A defence-in-depth (layered defence) approach requires all five to work together. No single control is sufficient on its own.
Defence-in-depth means that no single control is your entire defence. Input validation catches known injection patterns. System prompt hardening raises the bar for direct injection. Output validation catches schema violations. Approval gates stop the damage before it becomes irreversible. All layers are required.
With an understanding of security is a design constraint, not a feature in place, the discussion can now turn to input validation and sanitisation, which builds directly on these foundations.
Input validation is the process of checking all inputs to an agent system before they are processed, to ensure they conform to expected formats and do not contain malicious content. For agents, this means validating both direct user inputs and content retrieved from external sources: emails, web pages, documents, and tool results.
User input validation should check for common injection patterns using regular expressions, enforce maximum message lengths to prevent context-window flooding attacks, and log any rejected inputs for later review. Pattern matching alone is insufficient: attackers continuously develop new phrasings that bypass known patterns. However, it is a valuable first layer that catches automated and unsophisticated attacks without incurring inference costs.
Retrieved content sanitisation is more important and more commonly neglected. Before any content retrieved from an external source enters the agent's context window, it should have HTML and XML tags stripped (which may contain hidden instructions), zero-width Unicode characters removed (a common technique for hiding injected text from human readers while keeping it visible to the model), and length truncated to a reasonable limit. Content that exceeds the limit should be truncated with a clear marker so the agent knows the document was not fully processed.
“Developers of AI systems should implement measures to validate and sanitise user inputs, and should clearly separate instructions from data in the model context.”
UK AI Cyber Security Code of Practice - Principle 5: Secure the data and model supply chain, 2024
The UK Code of Practice explicitly names input validation and instruction-data separation as required security measures. Instruction-data separation means marking retrieved content as data (not instructions) in the model context, for instance by wrapping it in XML tags and instructing the model in the system prompt that content inside those tags should be treated as data to process, not instructions to follow.
Common misconception
“Prompt injection is a model problem. When the model is smart enough, it will know the difference between instructions and data.”
Current LLMs cannot reliably distinguish between instructions and data when both are presented in natural language within the same context window. This is not a matter of model intelligence: it is an architectural property of the attention mechanism. The correct mitigation is at the application layer: validate inputs, sanitise retrieved content, and use structured formats (XML tags, JSON schemas) to provide structural cues that separate data from instructions. Do not rely on the model to make this distinction unaided.
With an understanding of input validation and sanitisation in place, the discussion can now turn to output filtering and schema validation, which builds directly on these foundations.
Agent outputs should be validated before they affect downstream systems. This is particularly important for outputs that will be used as parameters in tool calls, fed into databases, or rendered in user interfaces. OWASP Agentic AI Top 10 AA03 (Unsafe Agent Output) specifically addresses the risk of unvalidated agent outputs propagating into downstream systems.
Schema validation is the most effective output filtering technique. Rather than trying to detect malicious content in free text, require the agent to produce structured output (JSON conforming to a defined schema) and validate it against that schema before use. If the agent's output does not conform, reject it and request a retry or escalate to a human reviewer.
Tools like Pydantic (for Python) make schema validation straightforward: define the expected output structure, instantiate the model with the agent's output, and catch validation errors. Pydantic validates field types, value ranges, and custom constraints (such as requiring an email address to contain an @ symbol) in a single step.
Tool call allow-lists are a complementary control. Rather than trying to detect which tool calls are malicious, define an explicit allow-list of which tools the agent is permitted to call, and block everything not on the list. This is the OWASP AA02 (Excessive Agency) mitigation applied at the output layer: the agent cannot call a tool that is not on its permitted list, even if it has been successfully prompted to try.
Validating agent output as a separate step before it reaches a tool or database means that even a successful prompt injection only gets as far as producing malformed output. It does not reach the file system, the email server, or the payment API.
With an understanding of output filtering and schema validation in place, the discussion can now turn to the principle of least privilege, which builds directly on these foundations.
The principle of least privilege states that every component should have the minimum permissions necessary to perform its function, and nothing more. For agents, this means giving each agent access only to the specific tools, data sources, and API credentials it needs for its defined task. An agent that summarises documents should not have access to an email sending tool. An agent that reads order status should not have payment processing credentials.
In practice, least privilege for agents covers three dimensions:
This approach directly implements OWASP Agentic AI AA02 (Excessive Agency) mitigation and aligns with NIST AI RMF Manage 2.4, which requires that AI risk treatments be applied and monitored, including limiting the capability of AI systems to the minimum required for their task.
“AI system components, including models and tools, should be assigned only the minimum level of privilege required to perform their intended function. The scope of agent actions should be limited by design.”
UK AI Cyber Security Code of Practice - Principle 3: Model the threats to your AI system, 2024
The UK Code of Practice connects threat modelling directly to privilege minimisation. The reasoning is straightforward: the threat model for an agent with fifteen tools includes all fifteen tools as potential misuse vectors. Reducing the tool set to three tools reduces the threat surface by eighty percent before any other control is applied.
Common misconception
“Least privilege reduces agent capability too much. A useful agent needs access to many tools.”
Least privilege does not mean a minimal agent. It means the right tools for the right task. A support agent, a research agent, and a finance agent can each have exactly the tools they need. Role-based tool sets, combined with human approval gates for high-risk actions, allow capable agents to operate safely. The alternative is a single over-privileged agent where any successful attack has maximum impact. The cost of least privilege is marginal. The cost of excessive agency, when exploited, is not.
With an understanding of the principle of least privilege in place, the discussion can now turn to human approval gates, which builds directly on these foundations.
For irreversible or high-impact actions, require explicit human confirmation before the agent executes the tool. This pattern is sometimes called human-in-the-loop (HITL) control, and it is the primary mitigation for OWASP AA04 (Overreliance on Agents).
The key design decision is which actions require approval. Classify every tool in the agent's tool set by risk level. Read-only operations on reversible data (searching a knowledge base, retrieving an order status) can typically run without approval. Write operations with reversible consequences (creating a support ticket, drafting an email for human review) may require a confirmation step. Irreversible or financially consequential operations (deleting records, sending email without review, processing payments, bulk communications) should always require explicit human approval before execution.
In production, the approval mechanism typically routes the pending action to a dashboard, a Slack channel, or an email, with a time-limited window for the approver to confirm or reject. If the window expires without a response, the action should be cancelled and the agent should notify the user that the action requires human review.
Human approval gates were absent in the 2023 LangChain email exfiltration scenario described at the start of this module. Had a human been required to approve thesend_email tool call before execution, the attack would have been stopped at that point, and the exfiltration would have been surfaced as an anomalous approval request.
With an understanding of human approval gates in place, the discussion can now turn to sandboxing code execution, which builds directly on these foundations.
If your agent can execute code (a common capability in data analysis and automation agents), that code must run in an isolated environment that cannot affect the host system or network. Sandboxing is the practice of running untrusted code inside a restricted execution environment that limits what the code can do.
The strongest approach is Docker containerisation. Each code execution request runs in a fresh container that has no network access, a read-only filesystem except for a temporary writable directory, and explicit CPU and memory limits. The container is destroyed after the code finishes executing. This means that even if an attacker successfully injects malicious code, the code cannot reach the network, cannot read files outside the temporary directory, and cannot persist anything to the host system.
Key parameters for a secure code execution container include: --network noneto prevent any network communication, --read-only to make the filesystem immutable, --memory 128m and --cpus 0.5 to prevent resource exhaustion attacks, and a subprocess timeout to kill executions that run longer than the permitted window. The container image should be a minimal base image with only the required runtime installed and no additional utilities that could be used to escalate privileges or probe the host environment.
With an understanding of sandboxing code execution in place, the discussion can now turn to audit logging, which builds directly on these foundations.
Every agent action that affects the real world should be logged with sufficient detail to reconstruct what happened and why. OWASP Agentic AI AA10 (Inadequate Audit Logging) ranks this in the top ten because without thorough logs, incident response and forensic analysis after a security event are nearly impossible.
A complete audit log entry for an agent tool call should include: a timestamp in UTC (Coordinated Universal Time), the session identifier, the agent name and version, the tool called, a summary of the tool inputs (but not secrets or personal data in plaintext), the result status (success or error), and the identity of any human approver. This provides the information needed to answer: what did the agent do, when, on whose behalf, using what inputs, and who authorised it?
Audit logs must be tamper-evident. If an attacker can delete or modify log entries, they can cover the evidence of what the agent was manipulated into doing. In practice, tamper-evidence is achieved by writing logs to append-only storage (such as an immutable log service or a write-once-read-many storage bucket) and using cryptographic chaining (each log entry includes a hash of the previous entry) so that any deletion or modification breaks the chain.
Audit logs should not contain secrets, full personal data, or complete tool result payloads. Logging these creates a high-value target: a single log file would contain all the sensitive data the agent processed. Instead, log summaries and truncated representations, and ensure the log retention and access policy complies with applicable data protection law (such as the UK GDPR or EU GDPR).
“Organisations should implement logging and monitoring of AI system activity to detect anomalous behaviour, support incident response, and enable post-incident forensic analysis.”
NIST AI Risk Management Framework 1.0 - Manage 2.4: Risk treatments are applied, monitored, and adjusted
The NIST AI RMF treats logging not as a compliance checkbox but as an operational necessity for the Manage function. Without logs, you cannot monitor whether risk treatments are working, you cannot detect when an agent is behaving anomalously, and you cannot conduct the post-incident analysis needed to improve your defences. The monitoring requirement also implies real-time alerting: a spike in tool call frequency or a pattern of approval rejections should trigger an automated alert.
Your customer support agent reads customer emails to understand their issues and can call these tools: search_knowledge_base, get_customer_account, send_email, process_refund, delete_account. A security review identifies issues. Which two tools should require human approval before execution, and why?
Your agent reads customer emails to understand their issues. What sanitisation step is most important to apply to the email content before it enters the agent's context window?
A security audit of your agent's audit logs reveals that successful tool calls are not logged, only errors. What is the most significant risk this creates?
You are designing an agent that analyses data uploaded by users and can execute Python code to perform calculations. What is the most important sandboxing requirement?
OWASP Top 10 for Large Language Model Applications 2025
LLM01: Prompt Injection; LLM06: Sensitive Information Disclosure
The foundational risk list for LLM-based systems. LLM01 and LLM06 underpin the input validation and output filtering controls in Sections 17.2 and 17.3.
OWASP Top 10 for Agentic AI Applications (2025)
AA02: Excessive Agency; AA03: Unsafe Agent Output; AA04: Overreliance on Agents; AA10: Inadequate Audit Logging
Agent-specific risks cited throughout Sections 17.3 to 17.7. AA02 motivates least privilege; AA04 motivates human approval gates; AA10 motivates thorough audit logging.
UK AI Cyber Security Code of Practice
Principle 3: Model the threats to your AI system; Principle 5: Secure the data and model supply chain
UK government voluntary framework for securing AI systems in production. Quoted in Sections 17.2 and 17.4 to show how input validation and least privilege are framed as regulatory expectations, not optional best practices.
NIST AI Risk Management Framework (AI RMF 1.0), January 2023
Manage 2.4: Risk treatments are applied, monitored, and adjusted
US government framework for AI risk governance. Manage 2.4 is cited in Sections 17.1 and 17.7 for the monitoring and logging requirements and the principle that security controls must be applied and monitored, not just documented.
arXiv:2302.12173
The research paper behind the opening case study. Demonstrates the LangChain email exfiltration attack and establishes that the vulnerability lies in design choices (no content sanitisation, no approval gates), not in any single component.
Docker Security Best Practices
Run containers with minimal privileges; use read-only filesystems; disable network access for untrusted code
Primary reference for the sandboxing pattern in Section 17.6. Docker's documentation specifies the flags and configuration options used to create isolated, resource-limited containers for untrusted code execution.
Module 17 of 25 in Security and Ethics