Loading lesson...
Loading lesson...

Real-world incident · May 2023
In May 2023, a New York law firm submitted a legal brief in a personal injury case against Avianca airline. The brief cited six precedent cases to support its arguments. When Avianca's lawyers checked the citations, they could not find the cases. They asked the court to compel the plaintiffs to produce the decisions. The plaintiffs' lawyers returned with an extraordinary admission: the cases had been generated by ChatGPT and did not exist. The AI had fabricated case names, docket numbers, courts, and procedural histories in convincing detail.
The lawyers involved had asked ChatGPT to confirm the cases were real. It confirmed they were. This is a direct consequence of how LLMs (large language models) work: they predict probable text continuations, and a plausible-sounding confirmation is statistically likely given the question. Federal Judge P. Kevin Castel sanctioned the lawyers and the firm in June 2023, ordering them to pay over USD 5,000 in fines and to notify the judges named in the fabricated decisions.
The root cause was not a missing feature: it was a misunderstanding of what an LLM actually does. An LLM does not look up cases; it generates text that looks like a case citation. Correct usage requires using temperature 0 to reduce variation, grounding responses in retrieved documents using a tool call, and validating output against a verified source before acting on it. This module teaches you how API calls work, which is the prerequisite for understanding how to design that validation.
The model generated citations that were plausible, formatted correctly, and completely fabricated. What parameter or design pattern would have caught this before it reached a federal judge?
With your environment configured and your API key securely stored, you are ready to make your first call to a language model. This module teaches you the mechanics of that call - request structure, token economics, temperature, and multi-turn conversations - so that every agent you build later rests on solid operational understanding.
With the learning outcomes established, this module begins by examining the api request structure in depth.
Every API call to a modern LLM follows the same basic structure: send a request containing a model name, a list of messages, and optional parameters; the model generates a response; you receive the response and extract the text. Understanding each component of this request is the foundation for building agents that behave reliably in production.
The Anthropic SDK for Python wraps this structure in client.messages.create(). The required parameters are model (which specific model to call),max_tokens (the maximum number of tokens to generate), andmessages (a list of dictionaries with role andcontent keys). The response object contains the generated text atmessage.content[0].text and token usage information atmessage.usage.
The equivalent call structure in the OpenAI SDK usesclient.chat.completions.create() with the same conceptual parameters. Most agent frameworks abstract over both APIs using a common interface, which is why understanding the underlying request structure matters more than memorising any one SDK's syntax.
“The messages API is designed to support multi-turn conversation as a first-class concern. Each message has a role, either user or assistant, and a content field containing the message text. The model generates the next assistant message given the provided history.”
Anthropic Messages API reference - docs.anthropic.com/en/api/messages
The design decision to centre the API on messages rather than a single prompt reflects how agents actually work: multi-turn interactions where each step builds on the previous. Designing your code around the messages list from the start, rather than retrofitting it later, is the correct approach for agent development.
With an understanding of the api request structure in place, the discussion can now turn to understanding tokens and cost, which builds directly on these foundations.
A token is the basic unit of text that a language model processes. In English, a token is roughly 4 characters or 0.75 words. "Unbelievable" is about 3 tokens; "AI" is 1 token. Pricing, context limits, and rate limits are all measured in tokens. Developers who do not track token usage in development are routinely surprised by production costs.
A single LLM call costs a small fraction of a cent. An agent that makes 20 calls in a loop, each with a long context window (system prompt plus conversation history plus tool results), can accumulate costs quickly. The Anthropic SDK returns token usage with every response: message.usage.input_tokens for the tokens you sent and message.usage.output_tokens for the tokens generated. Log both in development to calibrate your expectations before scaling.
The context window is the maximum number of tokens a model can process in a single call, including both input and output. Claude 3.5 Sonnet supports 200,000 tokens. This is generous, but an agent that accumulates long tool results over many steps can fill it. Strategies for managing context length, such as summarising older turns, are covered in later modules. For now, track token counts and set conservativemax_tokens limits while experimenting.
“Every token in the input and output contributes to cost. Agents that loop over multiple steps accumulate token costs across each turn. Production agents should include token budgeting logic to prevent unexpected charges.”
Anthropic Documentation - docs.anthropic.com/en/docs/build-with-claude/agents, Token management
Token budgeting is not optional in production. An agent given a complex task with no step limit or token budget can run for hundreds of turns before failing or producing output. The Avianca case involved a single call; production agents with loops can generate orders of magnitude more tokens. Measure first, then set limits.
Common misconception
“You only pay for the text the model generates. The prompt you send is free.”
Both input tokens and output tokens incur cost. Input costs are typically lower per token than output costs, but the system prompt, conversation history, and tool results in a long agent interaction can represent thousands of input tokens per call. Track both input_tokens and output_tokens from the usage field in every response to understand the true cost of your agent's operation.
With an understanding of understanding tokens and cost in place, the discussion can now turn to temperature and sampling parameters, which builds directly on these foundations.
Temperature is a parameter between 0 and 1 that controls the randomness of the model's output. Temperature 0 makes the model near-deterministic, always favouring the most probable next token. Temperature 1 introduces significant variation. The same prompt at temperature 0 produces almost identical output on repeated calls; at temperature 1, outputs can differ substantially.
For agent tasks, including any scenario where the model must produce structured JSON for a tool call, use temperature 0. High temperature increases the chance of the model generating malformed JSON, deviating from an expected format, or including text before or after the JSON object that breaks parsing. The Avianca case involved a model generating confident, plausible-sounding fabrications: a property of statistical next-token prediction that temperature 0 does not eliminate but does reduce significantly.
Temperature 0.3 to 0.5 suits summarisation and analysis tasks where some variation in phrasing is acceptable but consistency matters. Temperature 0.7 to 1.0 is appropriate for creative writing and brainstorming where varied outputs are desirable. Set temperature explicitly in every API call rather than relying on the provider's default, which may differ between models.
Common misconception
“Higher temperature makes agents more creative and better at handling edge cases.”
Higher temperature increases statistical variance in token selection. For agent tool calls that require structured output, variance is a defect, not a feature. A malformed JSON tool call causes an error; the agent must handle the exception, retry, or fail. Use temperature 0 for all agent tasks. Creativity is appropriate for the content the agent produces for humans; it is not appropriate for the machine-readable output that drives the agent loop.
With an understanding of temperature and sampling parameters in place, the discussion can now turn to system prompts, which builds directly on these foundations.
The system prompt sets the context and persona for the model before any user message. It is the mechanism by which agents receive their instructions, tool descriptions, and behavioural constraints. In the Anthropic SDK, pass it as thesystem parameter to client.messages.create(). It is not part of the messages list; it occupies a separate, privileged position in the request.
An effective system prompt defines four things: scope (what the agent will and will not help with), tone (formal, concise, friendly), format (maximum response length, use of lists or prose), and fallback behaviour (what the agent says when a request is outside its scope). A weak system prompt such as "You are a helpful assistant" provides none of these constraints and produces inconsistent behaviour at scale.
System prompts are counted as input tokens on every call. A 500-token system prompt on an agent that makes 50 calls in a session contributes 25,000 input tokens before any user message or tool result is counted. This is expected and acceptable; just include it in your token budget calculations.
With an understanding of system prompts in place, the discussion can now turn to multi-turn conversations, which builds directly on these foundations.
A conversation is represented as a list of messages with alternating userand assistant roles. The API is stateless: the model has no memory between calls. To continue a conversation, you must append both the model's previous response and the next user message to the history, then send the full list with every call. The model reads the entire history each time and generates the next assistant message.
This design choice is intentional. Stateless APIs are simpler to scale, easier to audit, and give the application full control over what context the model receives. The trade-off is that the application must manage conversation state. This responsibility sits in your code, not in the provider's infrastructure.
The practical implementation is a list called conversation_history that starts empty. Each turn appends a user message, calls the API with the full list, receives a response, and appends the assistant message. The function accepts the current history and a new user message, returns the assistant response and the updated history. Keeping the history management explicit and visible makes debugging straightforward: you can print the full history at any point to see exactly what the model received.
“The Chat Completions API is designed to be stateless: every request must include the full conversation history. This gives the caller complete control over what context the model has access to, at the cost of managing that state in the application layer.”
OpenAI API Reference - platform.openai.com/docs/api-reference/chat, Overview
Both Anthropic and OpenAI made the same design choice: stateless message lists. This pattern is now standard across all major LLM providers. Building your conversation management code to this pattern means it transfers across providers with minimal changes, which matters when building multi-provider agent systems.
In the Avianca case, lawyers used ChatGPT to research legal precedents and received fabricated case citations. Which combination of design decisions would have prevented this failure?
Your agent processes 1,000 customer queries per day. Each query uses 800 input tokens and 200 output tokens on average with Claude Sonnet 4.6 at USD 3 per million input tokens and USD 15 per million output tokens. What is the approximate daily cost?
The following Python function loses all conversation memory after each message. What is the specific missing element?
You are building a customer support agent that must return structured JSON containing the customer's issue category, urgency level, and suggested resolution. A colleague recommends temperature 0.7 to 'give the agent flexibility.' What is the risk?
Anthropic Messages API reference
docs.anthropic.com/en/api/messages
Complete parameter reference for client.messages.create, including model, max_tokens, system, messages, temperature, and the usage object. Referenced in Sections 5.1 and 5.4 as the authoritative specification.
github.com/anthropics/anthropic-sdk-python
Source code and usage examples for the Python client used in this module. Referenced in Section 5.1 to show the canonical code pattern for calling client.messages.create with proper error handling.
OpenAI Chat Completions API reference
platform.openai.com/docs/api-reference/chat
OpenAI's equivalent API. Quoted in Section 5.5 to establish that the stateless multi-turn message design is consistent across major providers, making the pattern transferable.
anthropic.com/pricing
Current token pricing for all Claude models. Referenced in Section 5.2 and the quiz to show that both input and output tokens incur costs and that output tokens are priced higher per million.
Mata v. Avianca (2023), US District Court for the Southern District of New York
Case 1:22-cv-01461-PKC, Order of June 22, 2023
The documented case in which lawyers were sanctioned for submitting AI-fabricated legal citations. Used as the opening case study to illustrate the consequences of using an LLM without retrieval grounding or output validation.
Module 5 of 25 in Foundations