Loading lesson...
Loading lesson...

Real-world incident · February 2024
In February 2024, a British Columbia Civil Resolution Tribunal ruled against Air Canada in a dispute brought by passenger Jake Moffatt. Moffatt had asked the airline's AI chatbot about bereavement fares after a family member died. The chatbot told him he could book a full-price ticket and then apply for a bereavement discount within 90 days of travel.
Air Canada's actual policy did not permit retrospective applications. The policy existing in the chatbot's training or tool data was incorrect or absent, and the chatbot had no tool call to a live policy database that could have retrieved the current, authoritative version. Moffatt booked his ticket and submitted the claim. Air Canada refused. The tribunal found Air Canada responsible for the chatbot's statements, ruling that the company could not escape liability by blaming a "separate legal entity."
The architectural failure was straightforward: the chatbot answered a policy question from parametric knowledge (its training data) rather than from a validated tool call to a live policy application programming interface (API). Tool design exists precisely to prevent this class of error. A well-designed tool schema for a refund policy query forces the agent to retrieve current policy data rather than guess from training.
The chatbot confidently stated a policy that did not exist. What should the tool execution layer have done differently, and at which point in the flow could the error have been caught?
Module 6 traced the agent's reasoning process. Reasoning alone produces text; tools produce action. This module covers the mechanism by which agents call external systems, validate inputs, handle errors, and return structured results to the reasoning loop.
With the learning outcomes established, this module begins by examining what a tool schema contains in depth.
A tool schema is a structured JSON (JavaScript Object Notation) object that describes a tool to the model. It contains three components: the tool's name, a natural-language description that guides when to use it, and an input schema defining the parameters the tool accepts. The model reads the schema at each reasoning step to decide which tool to call and with what arguments.
Anthropic and OpenAI use slightly different outer wrappers, but the core structure is identical. The Anthropic format uses an input_schemakey that follows JSON Schema Draft 2020-12. The OpenAI format wraps the same content in a function key within a type: "function"object. Provider-agnostic frameworks such as LangChain translate between these formats transparently, so understanding either format is sufficient to work with both.
The most consequential field is the description. It must answer three questions to be effective: what does this tool return, when should the model call it, and when should the model explicitly not call it. Without "do not use when" guidance, agents frequently call tools unnecessarily, adding latency and cost.
“The model uses the description field to decide whether to use the tool for a given situation. Poorly written descriptions are a common source of tool misuse.”
Anthropic Tool Use documentation, 2024 - docs.anthropic.com/en/docs/build-with-claude/tool-use, Tool design principles
This is the central principle of tool schema design. Description quality is more consequential than parameter completeness. A tool with five well-documented parameters and a vague description will be misused. A tool with two parameters and a precise description will be used correctly. Invest proportionally.
With an understanding of what a tool schema contains in place, the discussion can now turn to the tool execution flow, which builds directly on these foundations.
The model never directly calls your function. It requests a tool call by outputting structured JSON with the tool name and arguments. Your application code interprets that JSON, executes the actual function, and injects the result back into the context window as a tool result message. The model then reads the result in its next observe phase and decides whether to call another tool or generate a final response.
This separation is architecturally important. The model controls what to call and with what arguments. Your application controls whether the call is authorised, validates the arguments before executing, handles errors, and decides what to inject into the context. The boundary between model and application is the correct place to enforce security policies.
A tool result that includes a timestamp of retrieval allows the model to reason accurately about data freshness. Without it, the model cannot know whether a price or policy it retrieved is current. Including the retrieval timestamp as a standard field in every tool result is a small cost with significant accuracy benefits, particularly for agents that answer questions about time-sensitive data.
“The model outputs a tool_use content block when it decides to use a tool. Your application is responsible for executing the tool and returning the result in a tool_result block.”
Anthropic Tool Use documentation, 2024 - docs.anthropic.com/en/docs/build-with-claude/tool-use, Tool execution lifecycle
This describes the strict boundary in Anthropic's implementation: the model requests, the application executes, the application returns. This design means all validation, authorisation, and error handling live in application code, not in the model. Attempting to handle these concerns in the system prompt is fragile; handling them in the execution layer is strong.
With an understanding of the tool execution flow in place, the discussion can now turn to error handling strategies, which builds directly on these foundations.
Tool calls fail in four distinct ways, each requiring a different response strategy. Invalid parameters, such as a ticker symbol that does not exist, should return a structured error object and let the model decide how to recover. External API failures, such as rate limits or timeouts, should trigger exponential backoff with a maximum of three retries before returning an error. Authentication failures should return an error immediately without retrying, as retrying with an expired credential wastes time. Unexpected output, such as an API returning HTML when JSON (JavaScript Object Notation) was expected, should catch the parse exception and return a structured error to the model.
The key principle is to return structured error objects to the model rather than raising exceptions that crash the agent loop. An exception terminates the loop; a structured error message is injected into the context window as a tool result, allowing the model to read the error, understand what went wrong, and attempt recovery. The model may retry with corrected parameters, try an alternative tool, or inform the user that the requested action failed.
In production, error messages returned to the model should be sanitised. Internal stack traces may expose system internals or sensitive configuration. Log the full error server-side for debugging, but return a human-readable error message to the model that describes what failed without exposing implementation details.
Common misconception
“The model will automatically retry a failed tool call with corrected arguments.”
The model will retry only if it receives a structured error that describes what went wrong. If a tool call throws an unhandled Python exception that crashes your agent loop, the model never sees the error and cannot respond to it. Structured error returns keep the loop running. The model reads the error, updates its reasoning trace, and decides how to proceed. This is why the error-as-tool-result pattern is fundamental to strong agent design, not an optional optimisation.
With an understanding of error handling strategies in place, the discussion can now turn to input validation at the boundary, which builds directly on these foundations.
The model generates tool arguments based on probability distributions, not logic. It may produce a date in the wrong format, a number outside the valid range, or a string that references a resource that does not exist. Validating all tool inputs at the application boundary before executing is not defensive programming: it is the correct architecture for agent tool calls.
Pydantic, the standard Python validation library, is the most common choice for this boundary. A Pydantic model defined for each tool's input schema validates types, formats, and custom constraints before the underlying function is called. If validation fails, the Pydantic error object is serialised and returned as a structured tool result, giving the model precise information about what was wrong with its argument: which field, what constraint was violated, and what the model provided.
For tools that write data or trigger irreversible external actions, such as sending emails, submitting payments, or deleting records, input validation should be paired with an authorisation check. Confirm that the current user session is permitted to invoke this tool with these arguments before executing. The OWASP (Open Worldwide Application Security Project) Top 10 for Large Language Model Applications (2025) lists insufficient input handling (LLM04) and excessive agency (LLM08) as top risks in agentic applications.
Common misconception
“If the JSON schema defines parameter types, no additional validation is needed.”
JSON schemas describe the expected structure; they do not enforce it at execution time. A tool schema can specify that a field is a string, but the model can still generate a string containing a SQL injection fragment, an email field containing only a name, or a date string in an unrecognised format. JSON schema validation catches structural mismatches. Business logic validation, using Pydantic or equivalent, catches semantic errors. Both layers are necessary.
Your agent has a send_email tool. During testing, the agent calls it with a 'to' field containing only a name ('John Smith') rather than an email address. The email fails silently. At which point in the tool execution flow should you catch this, and what should you return?
Why is it better to return a structured error object to the model rather than raising a Python exception when a tool call fails?
The Air Canada chatbot incident involved an agent answering a policy question from training data rather than from a live tool call. Which tool design principle, if applied, would most directly have prevented this?
Anthropic Tool Use documentation, 2024
docs.anthropic.com/en/docs/build-with-claude/tool-use
Complete reference for Anthropic's tool call format, result injection, and the model-application execution boundary. Cited throughout Sections 7.1 and 7.2.
OpenAI Function Calling Guide, 2024
platform.openai.com/docs/guides/function-calling
Reference implementation for the OpenAI tool format. Patterns and principles translate across providers. Cited in Section 7.1 for format comparison.
docs.pydantic.dev
The standard Python library for runtime data validation. Cited in Section 7.4 as the recommended tool boundary validation approach.
json-schema.org/specification, Draft 2020-12
Authoritative reference for writing input_schema property definitions, including pattern constraints and enumeration validation. Cited in Section 7.1.
OWASP Top 10 for Large Language Model Applications 2025
LLM04: Insufficient Input Handling and LLM08: Excessive Agency
The industry security standard for LLM applications. LLM04 covers input validation at tool boundaries; LLM08 covers authorisation before irreversible actions. Cited in Section 7.4.
Module 7 of 25 · Core Concepts