Module 17 of 25 · Security and Ethics

Secure Implementation

45 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

Apply input validation and sanitisation patterns to block prompt injection attempts before they reach the agent
Implement output filtering so that agent-generated results are validated before affecting downstream systems
Design a permission model that enforces least privilege across agent tools, credentials, and approval gates
Specify audit logging requirements for agent systems and explain why tamper-evidence matters

Close-up of a combination lock in front of digital code (photo on Unsplash)

Real-world research · 2023

Researchers weaponised a LangChain email agent via a single malicious email.

In 2023, researchers Kai Greshake and colleagues at the CISPA Helmholtz Centre for Information Security demonstrated an attack against a LangChain-based personal assistant agent with access to the user's email inbox. LangChain is an open-source framework for building LLM applications; at the time of the research it was among the most widely adopted frameworks for agentic systems.

The attack used indirect prompt injection. The researchers sent the target account an email containing hidden instructions: text formatted to be invisible in a typical email client but readable by the agent when it processed the message. The instructions directed the agent to forward the contents of all other emails in the inbox to an external address controlled by the researchers.

The agent complied. It called the send_email tool, which was in its permitted tool set, with the contents of the user's other emails as the payload. From an infrastructure and logging perspective, everything looked normal. No credentials were stolen. No intrusion occurred. The agent simply used the tools it was given, following instructions it received through a channel it was designed to trust.

The root vulnerability was not in the LangChain framework and not in the LLM. It was in the design: no sanitisation of retrieved email content, no restriction on which addresses the agent could forward to, and no human approval gate before irreversible actions like sending email. All three gaps are addressable at implementation time.

The agent was behaving exactly as designed. It read the email, processed the content, and called the tools it was given. What part of the design was the actual vulnerability?

Module 16 identified the threats; this module implements the defences. Input validation, sandboxing, audit logging, least-privilege tool access, and defence in depth - each control is mapped to the specific threat it mitigates.

With the learning outcomes established, this module begins by examining security is a design constraint, not a feature in depth.

17.1 Security is a design constraint, not a feature

Retrofitting security onto an agent after it is built is expensive, incomplete, and often architecturally impossible without rebuilding significant parts of the system. The patterns in this module are most effective when applied from the initial design. This is not theoretical: the incidents that have attracted public attention in 2024-2025 almost exclusively involved agents with insufficient input validation, overly broad permissions, or no human approval gates for consequential actions.

The UK Government's AI Cyber Security Code of Practice (2024) states this principle directly: security requirements should be addressed during the design phase of an AI system, not added retrospectively. The Code of Practice is the UK's voluntary framework for organisations developing or deploying AI in production, and it aligns closely with the NIST AI Risk Management Framework (AI RMF) 1.0 Manage function (specifically Manage 2.4: Risk treatments are applied, monitored, and adjusted).

This module is organised around the five controls that address the highest-priority risks from Module 16: input validation, output filtering, least privilege, human approval gates, sandboxing for code execution, and audit logging. A defence-in-depth (layered defence) approach requires all five to work together. No single control is sufficient on its own.

Defence-in-depth means that no single control is your entire defence. Input validation catches known injection patterns. System prompt hardening raises the bar for direct injection. Output validation catches schema violations. Approval gates stop the damage before it becomes irreversible. All layers are required.

With an understanding of security is a design constraint, not a feature in place, the discussion can now turn to input validation and sanitisation, which builds directly on these foundations.

17.2 Input validation and sanitisation

Input validation is the process of checking all inputs to an agent system before they are processed, to ensure they conform to expected formats and do not contain malicious content. For agents, this means validating both direct user inputs and content retrieved from external sources: emails, web pages, documents, and tool results.

User input validation should check for common injection patterns using regular expressions, enforce maximum message lengths to prevent context-window flooding attacks, and log any rejected inputs for later review. Pattern matching alone is insufficient: attackers continuously develop new phrasings that bypass known patterns. However, it is a valuable first layer that catches automated and unsophisticated attacks without incurring inference costs.

Retrieved content sanitisation is more important and more commonly neglected. Before any content retrieved from an external source enters the agent's context window, it should have HTML and XML tags stripped (which may contain hidden instructions), zero-width Unicode characters removed (a common technique for hiding injected text from human readers while keeping it visible to the model), and length truncated to a reasonable limit. Content that exceeds the limit should be truncated with a clear marker so the agent knows the document was not fully processed.

“Developers of AI systems should implement measures to validate and sanitise user inputs, and should clearly separate instructions from data in the model context.”
UK AI Cyber Security Code of Practice - Principle 5: Secure the data and model supply chain, 2024
The UK Code of Practice explicitly names input validation and instruction-data separation as required security measures. Instruction-data separation means marking retrieved content as data (not instructions) in the model context, for instance by wrapping it in XML tags and instructing the model in the system prompt that content inside those tags should be treated as data to process, not instructions to follow.

Common misconception

“Prompt injection is a model problem. When the model is smart enough, it will know the difference between instructions and data.”

Current LLMs cannot reliably distinguish between instructions and data when both are presented in natural language within the same context window. This is not a matter of model intelligence: it is an architectural property of the attention mechanism. The correct mitigation is at the application layer: validate inputs, sanitise retrieved content, and use structured formats (XML tags, JSON schemas) to provide structural cues that separate data from instructions. Do not rely on the model to make this distinction unaided.

With an understanding of input validation and sanitisation in place, the discussion can now turn to output filtering and schema validation, which builds directly on these foundations.

17.3 Output filtering and schema validation

Agent outputs should be validated before they affect downstream systems. This is particularly important for outputs that will be used as parameters in tool calls, fed into databases, or rendered in user interfaces. OWASP Agentic AI Top 10 AA03 (Unsafe Agent Output) specifically addresses the risk of unvalidated agent outputs propagating into downstream systems.

Schema validation is the most effective output filtering technique. Rather than trying to detect malicious content in free text, require the agent to produce structured output (JSON conforming to a defined schema) and validate it against that schema before use. If the agent's output does not conform, reject it and request a retry or escalate to a human reviewer.

Tools like Pydantic (for Python) make schema validation straightforward: define the expected output structure, instantiate the model with the agent's output, and catch validation errors. Pydantic validates field types, value ranges, and custom constraints (such as requiring an email address to contain an @ symbol) in a single step.

Tool call allow-lists are a complementary control. Rather than trying to detect which tool calls are malicious, define an explicit allow-list of which tools the agent is permitted to call, and block everything not on the list. This is the OWASP AA02 (Excessive Agency) mitigation applied at the output layer: the agent cannot call a tool that is not on its permitted list, even if it has been successfully prompted to try.

Validating agent output as a separate step before it reaches a tool or database means that even a successful prompt injection only gets as far as producing malformed output. It does not reach the file system, the email server, or the payment API.

With an understanding of output filtering and schema validation in place, the discussion can now turn to the principle of least privilege, which builds directly on these foundations.

17.4 The principle of least privilege

The principle of least privilege states that every component should have the minimum permissions necessary to perform its function, and nothing more. For agents, this means giving each agent access only to the specific tools, data sources, and API credentials it needs for its defined task. An agent that summarises documents should not have access to an email sending tool. An agent that reads order status should not have payment processing credentials.

In practice, least privilege for agents covers three dimensions:

Tool scope: each agent instance should be initialised with only the tools required for its task. Do not create a single general-purpose agent with all available tools and trust the model to use them appropriately.
Credential scope: create separate API credentials for each agent role, with only the permissions that role requires. A customer support agent's database credentials should connect to a read-only replica, not the primary read-write database.
Data scope: if the agent reads documents, limit it to documents within the scope of the task. Do not give a summarisation agent access to the entire document store.

This approach directly implements OWASP Agentic AI AA02 (Excessive Agency) mitigation and aligns with NIST AI RMF Manage 2.4, which requires that AI risk treatments be applied and monitored, including limiting the capability of AI systems to the minimum required for their task.

“AI system components, including models and tools, should be assigned only the minimum level of privilege required to perform their intended function. The scope of agent actions should be limited by design.”
UK AI Cyber Security Code of Practice - Principle 3: Model the threats to your AI system, 2024
The UK Code of Practice connects threat modelling directly to privilege minimisation. The reasoning is straightforward: the threat model for an agent with fifteen tools includes all fifteen tools as potential misuse vectors. Reducing the tool set to three tools reduces the threat surface by eighty percent before any other control is applied.

Common misconception

“Least privilege reduces agent capability too much. A useful agent needs access to many tools.”

Least privilege does not mean a minimal agent. It means the right tools for the right task. A support agent, a research agent, and a finance agent can each have exactly the tools they need. Role-based tool sets, combined with human approval gates for high-risk actions, allow capable agents to operate safely. The alternative is a single over-privileged agent where any successful attack has maximum impact. The cost of least privilege is marginal. The cost of excessive agency, when exploited, is not.

With an understanding of the principle of least privilege in place, the discussion can now turn to human approval gates, which builds directly on these foundations.

17.5 Human approval gates

For irreversible or high-impact actions, require explicit human confirmation before the agent executes the tool. This pattern is sometimes called human-in-the-loop (HITL) control, and it is the primary mitigation for OWASP AA04 (Overreliance on Agents).

The key design decision is which actions require approval. Classify every tool in the agent's tool set by risk level. Read-only operations on reversible data (searching a knowledge base, retrieving an order status) can typically run without approval. Write operations with reversible consequences (creating a support ticket, drafting an email for human review) may require a confirmation step. Irreversible or financially consequential operations (deleting records, sending email without review, processing payments, bulk communications) should always require explicit human approval before execution.

In production, the approval mechanism typically routes the pending action to a dashboard, a Slack channel, or an email, with a time-limited window for the approver to confirm or reject. If the window expires without a response, the action should be cancelled and the agent should notify the user that the action requires human review.

Human approval gates were absent in the 2023 LangChain email exfiltration scenario described at the start of this module. Had a human been required to approve thesend_email tool call before execution, the attack would have been stopped at that point, and the exfiltration would have been surfaced as an anomalous approval request.

With an understanding of human approval gates in place, the discussion can now turn to sandboxing code execution, which builds directly on these foundations.

17.6 Sandboxing code execution

If your agent can execute code (a common capability in data analysis and automation agents), that code must run in an isolated environment that cannot affect the host system or network. Sandboxing is the practice of running untrusted code inside a restricted execution environment that limits what the code can do.

The strongest approach is Docker containerisation. Each code execution request runs in a fresh container that has no network access, a read-only filesystem except for a temporary writable directory, and explicit CPU and memory limits. The container is destroyed after the code finishes executing. This means that even if an attacker successfully injects malicious code, the code cannot reach the network, cannot read files outside the temporary directory, and cannot persist anything to the host system.

Key parameters for a secure code execution container include: --network noneto prevent any network communication, --read-only to make the filesystem immutable, --memory 128m and --cpus 0.5 to prevent resource exhaustion attacks, and a subprocess timeout to kill executions that run longer than the permitted window. The container image should be a minimal base image with only the required runtime installed and no additional utilities that could be used to escalate privileges or probe the host environment.

With an understanding of sandboxing code execution in place, the discussion can now turn to audit logging, which builds directly on these foundations.

17.7 Audit logging

Every agent action that affects the real world should be logged with sufficient detail to reconstruct what happened and why. OWASP Agentic AI AA10 (Inadequate Audit Logging) ranks this in the top ten because without thorough logs, incident response and forensic analysis after a security event are nearly impossible.

A complete audit log entry for an agent tool call should include: a timestamp in UTC (Coordinated Universal Time), the session identifier, the agent name and version, the tool called, a summary of the tool inputs (but not secrets or personal data in plaintext), the result status (success or error), and the identity of any human approver. This provides the information needed to answer: what did the agent do, when, on whose behalf, using what inputs, and who authorised it?

Audit logs must be tamper-evident. If an attacker can delete or modify log entries, they can cover the evidence of what the agent was manipulated into doing. In practice, tamper-evidence is achieved by writing logs to append-only storage (such as an immutable log service or a write-once-read-many storage bucket) and using cryptographic chaining (each log entry includes a hash of the previous entry) so that any deletion or modification breaks the chain.

Audit logs should not contain secrets, full personal data, or complete tool result payloads. Logging these creates a high-value target: a single log file would contain all the sensitive data the agent processed. Instead, log summaries and truncated representations, and ensure the log retention and access policy complies with applicable data protection law (such as the UK GDPR or EU GDPR).

“Organisations should implement logging and monitoring of AI system activity to detect anomalous behaviour, support incident response, and enable post-incident forensic analysis.”
NIST AI Risk Management Framework 1.0 - Manage 2.4: Risk treatments are applied, monitored, and adjusted
The NIST AI RMF treats logging not as a compliance checkbox but as an operational necessity for the Manage function. Without logs, you cannot monitor whether risk treatments are working, you cannot detect when an agent is behaving anomalously, and you cannot conduct the post-incident analysis needed to improve your defences. The monitoring requirement also implies real-time alerting: a spike in tool call frequency or a pattern of approval rejections should trigger an automated alert.

17.8 Check your understanding

Your customer support agent reads customer emails to understand their issues and can call these tools: search_knowledge_base, get_customer_account, send_email, process_refund, delete_account. A security review identifies issues. Which two tools should require human approval before execution, and why?

Your agent reads customer emails to understand their issues. What sanitisation step is most important to apply to the email content before it enters the agent's context window?

A security audit of your agent's audit logs reveals that successful tool calls are not logged, only errors. What is the most significant risk this creates?

You are designing an agent that analyses data uploaded by users and can execute Python code to perform calculations. What is the most important sandboxing requirement?

Key takeaways

Security must be designed in from the start. Retrofitting input validation, permission controls, and approval gates onto a deployed agent is expensive and incomplete.
Validate all inputs: user messages for injection patterns, retrieved content for HTML tags, zero-width characters, and excessive length before any of it enters the agent's context window.
Validate all outputs: agent-generated JSON should pass schema validation before being used as tool call parameters. Tool call allow-lists prevent calls to tools not on the permitted list.
Apply least privilege at three levels: tool scope (only the tools the agent needs), credential scope (read-only where possible, scoped API keys), and data scope (only the data relevant to the task).
Require human approval for irreversible or financially consequential actions: payments, deletions, bulk communications, and any external transmission of sensitive data.
Log every tool call, including successful ones, with session ID, tool name, input summary, result status, and approver identity. Write to tamper-evident append-only storage.

Standards and sources cited in this module

OWASP Top 10 for Large Language Model Applications 2025
LLM01: Prompt Injection; LLM06: Sensitive Information Disclosure
The foundational risk list for LLM-based systems. LLM01 and LLM06 underpin the input validation and output filtering controls in Sections 17.2 and 17.3.
OWASP Top 10 for Agentic AI Applications (2025)
AA02: Excessive Agency; AA03: Unsafe Agent Output; AA04: Overreliance on Agents; AA10: Inadequate Audit Logging
Agent-specific risks cited throughout Sections 17.3 to 17.7. AA02 motivates least privilege; AA04 motivates human approval gates; AA10 motivates thorough audit logging.
UK AI Cyber Security Code of Practice
Principle 3: Model the threats to your AI system; Principle 5: Secure the data and model supply chain
UK government voluntary framework for securing AI systems in production. Quoted in Sections 17.2 and 17.4 to show how input validation and least privilege are framed as regulatory expectations, not optional best practices.
NIST AI Risk Management Framework (AI RMF 1.0), January 2023
Manage 2.4: Risk treatments are applied, monitored, and adjusted
US government framework for AI risk governance. Manage 2.4 is cited in Sections 17.1 and 17.7 for the monitoring and logging requirements and the principle that security controls must be applied and monitored, not just documented.
Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
arXiv:2302.12173
The research paper behind the opening case study. Demonstrates the LangChain email exfiltration attack and establishes that the vulnerability lies in design choices (no content sanitisation, no approval gates), not in any single component.
Docker Security Best Practices
Run containers with minimal privileges; use read-only filesystems; disable network access for untrusted code
Primary reference for the sandboxing pattern in Section 17.6. Docker's documentation specifies the flags and configuration options used to create isolated, resource-limited containers for untrusted code execution.

Previous: The Threat Landscape Next: Ethics and Responsible AI

Module 17 of 25 in Security and Ethics

Loading lesson...

Module 17 of 25 · Security and Ethics

Secure Implementation

45 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

Apply input validation and sanitisation patterns to block prompt injection attempts before they reach the agent
Implement output filtering so that agent-generated results are validated before affecting downstream systems
Design a permission model that enforces least privilege across agent tools, credentials, and approval gates
Specify audit logging requirements for agent systems and explain why tamper-evidence matters

Real-world research · 2023

Researchers weaponised a LangChain email agent via a single malicious email.

The agent was behaving exactly as designed. It read the email, processed the content, and called the tools it was given. What part of the design was the actual vulnerability?

With the learning outcomes established, this module begins by examining security is a design constraint, not a feature in depth.

17.1 Security is a design constraint, not a feature

With an understanding of security is a design constraint, not a feature in place, the discussion can now turn to input validation and sanitisation, which builds directly on these foundations.

17.2 Input validation and sanitisation

“Developers of AI systems should implement measures to validate and sanitise user inputs, and should clearly separate instructions from data in the model context.”
UK AI Cyber Security Code of Practice - Principle 5: Secure the data and model supply chain, 2024
The UK Code of Practice explicitly names input validation and instruction-data separation as required security measures. Instruction-data separation means marking retrieved content as data (not instructions) in the model context, for instance by wrapping it in XML tags and instructing the model in the system prompt that content inside those tags should be treated as data to process, not instructions to follow.

Common misconception

“Prompt injection is a model problem. When the model is smart enough, it will know the difference between instructions and data.”

With an understanding of input validation and sanitisation in place, the discussion can now turn to output filtering and schema validation, which builds directly on these foundations.

17.3 Output filtering and schema validation

With an understanding of output filtering and schema validation in place, the discussion can now turn to the principle of least privilege, which builds directly on these foundations.

17.4 The principle of least privilege

In practice, least privilege for agents covers three dimensions:

Tool scope: each agent instance should be initialised with only the tools required for its task. Do not create a single general-purpose agent with all available tools and trust the model to use them appropriately.
Credential scope: create separate API credentials for each agent role, with only the permissions that role requires. A customer support agent's database credentials should connect to a read-only replica, not the primary read-write database.
Data scope: if the agent reads documents, limit it to documents within the scope of the task. Do not give a summarisation agent access to the entire document store.

“AI system components, including models and tools, should be assigned only the minimum level of privilege required to perform their intended function. The scope of agent actions should be limited by design.”
UK AI Cyber Security Code of Practice - Principle 3: Model the threats to your AI system, 2024
The UK Code of Practice connects threat modelling directly to privilege minimisation. The reasoning is straightforward: the threat model for an agent with fifteen tools includes all fifteen tools as potential misuse vectors. Reducing the tool set to three tools reduces the threat surface by eighty percent before any other control is applied.

Common misconception

“Least privilege reduces agent capability too much. A useful agent needs access to many tools.”

With an understanding of the principle of least privilege in place, the discussion can now turn to human approval gates, which builds directly on these foundations.

17.5 Human approval gates

With an understanding of human approval gates in place, the discussion can now turn to sandboxing code execution, which builds directly on these foundations.

17.6 Sandboxing code execution

With an understanding of sandboxing code execution in place, the discussion can now turn to audit logging, which builds directly on these foundations.

17.7 Audit logging

“Organisations should implement logging and monitoring of AI system activity to detect anomalous behaviour, support incident response, and enable post-incident forensic analysis.”
NIST AI Risk Management Framework 1.0 - Manage 2.4: Risk treatments are applied, monitored, and adjusted
The NIST AI RMF treats logging not as a compliance checkbox but as an operational necessity for the Manage function. Without logs, you cannot monitor whether risk treatments are working, you cannot detect when an agent is behaving anomalously, and you cannot conduct the post-incident analysis needed to improve your defences. The monitoring requirement also implies real-time alerting: a spike in tool call frequency or a pattern of approval rejections should trigger an automated alert.

17.8 Check your understanding

Your agent reads customer emails to understand their issues. What sanitisation step is most important to apply to the email content before it enters the agent's context window?

A security audit of your agent's audit logs reveals that successful tool calls are not logged, only errors. What is the most significant risk this creates?

You are designing an agent that analyses data uploaded by users and can execute Python code to perform calculations. What is the most important sandboxing requirement?

Key takeaways

Security must be designed in from the start. Retrofitting input validation, permission controls, and approval gates onto a deployed agent is expensive and incomplete.
Validate all inputs: user messages for injection patterns, retrieved content for HTML tags, zero-width characters, and excessive length before any of it enters the agent's context window.
Validate all outputs: agent-generated JSON should pass schema validation before being used as tool call parameters. Tool call allow-lists prevent calls to tools not on the permitted list.
Apply least privilege at three levels: tool scope (only the tools the agent needs), credential scope (read-only where possible, scoped API keys), and data scope (only the data relevant to the task).
Require human approval for irreversible or financially consequential actions: payments, deletions, bulk communications, and any external transmission of sensitive data.
Log every tool call, including successful ones, with session ID, tool name, input summary, result status, and approver identity. Write to tamper-evident append-only storage.

Standards and sources cited in this module

OWASP Top 10 for Large Language Model Applications 2025
LLM01: Prompt Injection; LLM06: Sensitive Information Disclosure
The foundational risk list for LLM-based systems. LLM01 and LLM06 underpin the input validation and output filtering controls in Sections 17.2 and 17.3.
OWASP Top 10 for Agentic AI Applications (2025)
AA02: Excessive Agency; AA03: Unsafe Agent Output; AA04: Overreliance on Agents; AA10: Inadequate Audit Logging
Agent-specific risks cited throughout Sections 17.3 to 17.7. AA02 motivates least privilege; AA04 motivates human approval gates; AA10 motivates thorough audit logging.
UK AI Cyber Security Code of Practice
Principle 3: Model the threats to your AI system; Principle 5: Secure the data and model supply chain
UK government voluntary framework for securing AI systems in production. Quoted in Sections 17.2 and 17.4 to show how input validation and least privilege are framed as regulatory expectations, not optional best practices.
NIST AI Risk Management Framework (AI RMF 1.0), January 2023
Manage 2.4: Risk treatments are applied, monitored, and adjusted
US government framework for AI risk governance. Manage 2.4 is cited in Sections 17.1 and 17.7 for the monitoring and logging requirements and the principle that security controls must be applied and monitored, not just documented.
Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
arXiv:2302.12173
The research paper behind the opening case study. Demonstrates the LangChain email exfiltration attack and establishes that the vulnerability lies in design choices (no content sanitisation, no approval gates), not in any single component.
Docker Security Best Practices
Run containers with minimal privileges; use read-only filesystems; disable network access for untrusted code
Primary reference for the sandboxing pattern in Section 17.6. Docker's documentation specifies the flags and configuration options used to create isolated, resource-limited containers for untrusted code execution.

Previous: The Threat Landscape Next: Ethics and Responsible AI

Module 17 of 25 in Security and Ethics