Loading lesson...
Loading lesson...

Real-world failure · 2023
Throughout 2023, teams using LangChain in production reported a recurring class of failures: agents that worked correctly in development began failing in production when conversation histories grew long. The failures were not consistent errors; they were degraded behaviour, incorrect tool calls, and outputs that ignored prior context. The root cause, in most documented cases, was memory management.
LangChain's ConversationBufferMemory and ConversationSummaryMemory classes had subtle bugs in how they handled token counting and history truncation. In development, with short conversations, the bugs were invisible. In production, with conversations running to hundreds of messages, the memory layer would either overflow the context window without warning, truncate history in ways that removed critical earlier context, or produce malformed message lists that caused the model to misinterpret the conversation structure.
The episode demonstrated two architectural lessons. First, framework abstractions over memory management must be treated as critical infrastructure, not as convenience utilities, and must be tested explicitly at the token and message level. Second, agent state (what the orchestration code tracks) and conversation history (what the model reads) are distinct concerns that should be managed by separate, independently testable components. Conflating them in a single framework class produces a single point of failure for both.
If the memory management layer in your framework silently corrupts the agent state at scale, how would you detect this, and what does it reveal about the relationship between framework abstraction and production reliability?
Design patterns tell you how an agent reasons; architecture tells you how the system around it is built. This module covers state management, control flow, error handling, and the graph-based execution model used by LangGraph and similar frameworks.
With the learning outcomes established, this module begins by examining the framework landscape in depth.
An agent framework provides abstractions for building AI agents: state management, tool integration, agent-to-agent communication, and orchestration. Frameworks reduce boilerplate but introduce opinions about how agents should be structured. Choosing a framework before understanding which pattern you need is a common mistake. Framework choice should follow pattern choice.
LangGraph, released in 2024, models agent workflows as directed graphs of nodes. Each node is a Python function or LLM call. Edges define which node runs next, with conditional edges enabling branching. State is shared across all nodes using a typed dictionary. LangGraph is production-ready and is the best-supported choice for complex, stateful workflows, human-in-the-loop approval gates, and workflows that require checkpointing (saving state to resume after interruption).
CrewAI models agents as a team of roles: a researcher, a writer, a reviewer. Each agent has a role description, a goal, and a set of tools. The framework manages task assignment and inter-agent communication. CrewAI suits tasks that map naturally to human team collaboration, where the role metaphor helps structure the system prompt and tool assignment. It is simpler to set up than LangGraph for team-style tasks but less expressive for arbitrary workflow topologies.
The Anthropic Agent SDK is a lightweight Python library for Claude-native agents. It provides minimal abstraction over the Anthropic API and is suitable for teams that want direct control over the agent loop without framework overhead. Recommended for projects built entirely on Claude where simplicity and auditability are priorities.
“LangGraph is a library for building stateful, multi-actor applications with LLMs, used to create agent and multi-agent workflows. LangGraph models agent workflows as graphs with nodes for actions and edges for transitions.”
LangGraph documentation, 2024 - langchain-ai.github.io/langgraph, Introduction
The graph model is the key architectural insight. By representing a workflow as a directed graph, LangGraph makes the control flow explicit and auditable. Any edge can be made conditional. Any node can read and write shared state. Cycles are supported for iterative loops. This gives engineers the same tools for reasoning about agent control flow that they have for reasoning about data pipelines.
With an understanding of the framework landscape in place, the discussion can now turn to state management in langgraph, which builds directly on these foundations.
In LangGraph, all nodes share a typed state object defined as a TypedDict or Pydantic model. Each node receives the current state, performs its work, and returns a partial update to the state. The framework merges the update into the shared state before passing it to the next node. Fields annotated withoperator.add are append-only lists, meaning each node's output is appended to the existing list rather than replacing it. This is the standard pattern for conversation history within a LangGraph workflow.
Conditional edges route to different nodes based on a function that reads the current state. This is how branching logic is implemented: a routing function checks whether the last message contains a tool call, whether the step count has exceeded a limit, or whether a flag in the state indicates a specific branch. The routing function returns a string key; the conditional edge map translates that key to the next node name.
LangGraph supports checkpointing through a checkpointer object attached at compilation time. A checkpointer saves the full state at each node transition to a persistence layer, such as an in-memory store for development or a PostgreSQL database for production. This enables two important features: resuming a workflow after an interruption, and human-in-the-loop approval where the workflow pauses at a defined node and waits for operator input before continuing.
“The interrupt primitive pauses graph execution at a specific node and surfaces the current state to an external actor. When the actor provides input, execution resumes from the same point with the updated state.”
LangGraph documentation, 2024 - langchain-ai.github.io/langgraph, Human-in-the-loop patterns
This is a critical production feature. Agents that can send emails, make purchases, delete records, or trigger external processes must have a human approval gate before irreversible actions. LangGraph's interrupt primitive provides this at the architectural level: the workflow pauses, the operator reviews the proposed action, and execution resumes only when the operator approves, rejects, or modifies the plan. This is the correct architecture for safe autonomous agents in production.
With an understanding of state management in langgraph in place, the discussion can now turn to agent state versus conversation history, which builds directly on these foundations.
Agent state and conversation history serve different purposes and should be managed separately. Conflating them, as the 2023 LangChain memory management incidents demonstrated, produces a single point of failure for both concerns.
Conversation history is the list of user and assistant messages that the LLM reads in the context window. Its purpose is to give the model the information it needs to decide what to do next. It grows with the conversation, must be managed to stay within the context window limit, and is consumed by the model on every LLM call.
Agent state is the full set of variables that represent where the agent is in executing a task. It may include the conversation history, but also progress flags, step counters, extracted data, intermediate results, error counts, and retry limits. The orchestration code reads agent state to determine which node runs next. The LLM reads state only when the orchestration code explicitly injects relevant parts into the context window.
A document processing agent might track pages processed, extraction errors, summary chunks generated so far, and a retry counter. None of this needs to appear in the LLM's context window; it is orchestration state. The LLM only needs the current page content and the task instruction. Keeping orchestration state out of the context window reduces token cost, reduces context window pressure, and prevents the model from being confused by state metadata it does not need to reason about.
Common misconception
“Agent state and conversation history are the same thing stored in different formats.”
Conversation history is what the LLM reads; agent state is what the orchestration code uses to control flow. A step counter, an error log, and a progress flag are agent state. The LLM does not need to read them to decide what to do next; the orchestration code reads them to decide which node runs next. Injecting all agent state into the context window wastes tokens and can confuse the model. Define a clear boundary between orchestration state and model context from the start of the project.
Common misconception
“LangGraph is the correct framework for every production agent.”
LangGraph is the most expressive production framework for stateful workflows, but it introduces real complexity: a graph mental model, typed state schemas, conditional edge routing, and checkpoint management. For agents that are genuinely simple, such as a customer support bot with five tools and no complex branching, LangGraph adds overhead with no benefit. The Anthropic SDK or a direct API loop is often faster to build, easier to reason about, and more maintainable. Choose complexity proportional to the problem.
With an understanding of agent state versus conversation history in place, the discussion can now turn to async execution patterns, which builds directly on these foundations.
Synchronous agents execute tool calls one at a time. If an agent needs to search three news sources, look up financial data, and retrieve documents from a knowledge base, synchronous execution takes the sum of all five operation latencies. If each operation takes one second, the total is five seconds. Async execution runs independent operations concurrently: all five complete in roughly one second, the time of the slowest individual operation.
Python's asyncio library provides the primitives for async tool execution. asyncio.gather accepts a list of coroutines and runs them concurrently, returning their results in order. LangGraph supports async natively: nodes can be defined as async functions, and the framework manages the event loop.
Not all operations can be parallelised. Operations with dependencies must remain sequential: if step two needs the output of step one to determine its parameters, it must wait. Misapplying parallelism to dependent operations produces race conditions and incorrect results. The correct approach is to identify the dependency graph of the operations before deciding which can run concurrently. Independent lookups, such as searching multiple sources for the same query, are always candidates for async execution.
The human-in-the-loop approval pattern is not optional for production agents that trigger irreversible external actions. An agent that sends emails, submits payments, or modifies production data should always surface its proposed action to a human operator before executing. LangGraph's interrupt primitive is the standard implementation.
You are building a content marketing agent: it researches a topic, drafts an article, presents it to a human editor, revises based on feedback, then publishes via a CMS API. Which framework feature is essential for the editor approval step?
The research phase involves searching three different news sources. Which execution pattern should you use for those searches?
Which statement best describes the difference between agent state and conversation history?
A colleague proposes using LangGraph for a simple FAQ chatbot that answers questions from a fixed knowledge base using five tools. What is your assessment?
langchain-ai.github.io/langgraph
Official reference for graph model, state schema, conditional routing, checkpointing, and human-in-the-loop patterns. Quoted in Sections 10.1 and 10.2.
docs.crewai.com
Framework for role-based multi-agent teams. Compared to LangGraph in Section 10.1 for use case differentiation.
Anthropic Agent SDK documentation
docs.anthropic.com/en/docs/build-with-claude/agents
Lightweight SDK for Claude-native agents. Referenced in Section 10.1 as the recommended choice for simple, auditable Claude projects.
docs.python.org/3/library/asyncio.html
Official reference for async/await patterns, asyncio.gather, and concurrent task execution. Cited in Section 10.4 for async tool execution implementation.
Model Context Protocol Specification
modelcontextprotocol.io/specification
The open protocol for standardised tool and resource exposure to LLM agents. Referenced in Section 10.1 as the emerging standard for framework-agnostic tool integration.
Module 10 of 25 · Core Concepts