Loading lesson...
Loading lesson...

Real-world demonstration · March 2023
In March 2023, Cognition AI demonstrated Devin, described as an autonomous AI software engineer. In the demonstration, Devin was given a GitHub issue: a bug report with a description of unexpected behaviour in an open-source repository. Devin read the issue, explored the codebase, formed a plan, wrote and ran tests, edited the code, verified the fix passed the tests, and submitted a pull request. The entire sequence ran without human intervention at any individual step.
The underlying architecture was Plan-and-Execute. Devin's planning model generated a structured task list from the issue description. A separate executor model worked through each step: reading files, running terminal commands, editing code, and interpreting test output. The planner could revise the plan when a step produced unexpected results. The executor did not need to reason about the overall goal; it only needed to execute the current step correctly.
Independent evaluations subsequently found that Devin's real-world performance on SWE-bench (a standardised benchmark of GitHub issues) was 13.86%, versus the claimed 13.86% on a curated subset. The pattern worked. Autonomous task completion at the complexity of real software engineering issues remained challenging. The demonstration illustrated both the power of the Plan-and-Execute pattern and the importance of distinguishing curated demonstrations from benchmark performance.
The demo showed an agent completing a multi-step software task without human intervention at each step. What structural property of the Plan-and-Execute pattern made this level of autonomous execution possible, and what were its limits?
You now understand reasoning, tools, and memory as individual components. This module introduces the patterns that compose them: ReAct, Plan-and-Execute, and reflexion. Choosing the right pattern for a task is the first real architectural decision you will make.
With the learning outcomes established, this module begins by examining why patterns exist in depth.
When building agents, the first instinct is usually a single agent with all available tools. This works well for simple, focused tasks. It fails predictably for complex ones: the context window fills with irrelevant tool history, the agent confuses tools meant for different sub-tasks, and failure in one part blocks everything else. Patterns are the vocabulary for recognising these failure modes early and choosing an architecture that avoids them.
Six patterns cover the vast majority of agent use cases. Each pattern has a specific failure mode it avoids, a latency and cost profile, and a set of conditions under which it outperforms the alternatives. Selecting the right pattern for a task is a consequential design decision made at the start of a project, not an optimisation applied after the agent is already in production.
With an understanding of why patterns exist in place, the discussion can now turn to the six patterns, which builds directly on these foundations.
Single agent. One LLM, a set of tools, a loop. The simplest and most underused pattern. Appropriate for well-defined tasks with fewer than ten tools, short to medium conversation length, and a single domain. Avoid when the tool set is large enough to cause selection confusion or when tasks can be parallelised for significant latency savings.
Router. A lightweight routing agent reads each incoming request and delegates it to the most appropriate specialised agent. Use when there are multiple distinct task types with different tool sets, and when using cheaper or faster models for simple routing decisions saves cost. Keep the routing categories simple: a router that classifies into ten categories is likely solving the wrong problem.
Supervisor. A supervisor agent orchestrates multiple sub-agents, assigning tasks, collecting results, and synthesising a final output. Use when a task can be broken into parallel sub-tasks and a synthesised output is needed. Note that a supervisor synthesises what its workers report. If a worker hallucinates, the supervisor may incorporate that error into the final output. Validation must be built into worker tool calls, not only into the supervisor's synthesis step.
Map-Reduce. A large collection of inputs is processed by mapping an operation across each item (potentially in parallel using async execution), then reducing the results into a single output. Use for processing many similar items such as documents, reviews, or records. Cheap map-phase models and async execution make this pattern highly cost-effective for batch workloads.
Chain. Output from one step is the input to the next. Steps run sequentially; each transforms the data before passing it forward. Use for pipeline tasks with distinct transformations and quality gates between steps. Each step can be independently tested and optimised. Failures are localised to a specific step, making debugging straightforward.
Reflection. The agent generates an initial response, critiques its own output against the goal, and revises before returning the final answer. Use for quality-critical tasks where accuracy matters more than speed. The risk: reflection adds latency and cost, and the agent may miss the same errors it made initially since the same reasoning patterns are applied to both generation and critique.
“AutoGen enables building the next generation of LLM applications using multi-agent conversations. Agents can converse with each other in forms that involve human participation, LLM-generated code execution, and multi-step workflows.”
Wu, Q. et al., 2023 - AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, arXiv:2308.08155
AutoGen introduced the multi-agent conversation model, a generalisation of the supervisor pattern. The key insight is that structuring a conversation between agents, rather than designing a monolithic agent, enables specialisation: each agent can have a focused system prompt, a restricted tool set, and a clear role. This reduces context pollution and tool confusion while enabling complex task decomposition.
With an understanding of the six patterns in place, the discussion can now turn to pattern selection guide, which builds directly on these foundations.
The pattern selection decision follows a consistent logic. Start with the simplest option that satisfies the requirements. Add complexity only when there is a specific, measurable problem: context overflow, tool confusion, the need for parallelism, or quality improvement that justifies added latency.
For tasks that are simple and focused on one domain, single agent is almost always the right starting point. For tasks with multiple distinct types coming through a single entry point, router pattern keeps tool sets clean per agent. For tasks that decompose into parallel sub-tasks with a synthesised output, supervisor pattern enables concurrent work. For processing large batches of similar items, map-reduce with async execution processes them concurrently at low per-item cost. For sequential transformation pipelines with discrete stages, chain pattern localises failures and enables per-step optimisation. For quality-critical output where a second pass consistently improves results, reflection pattern adds a self-review cycle.
“Reflexion: an alternative approach to reinforcement learning for language agents. Rather than updating weights, the agent stores verbal feedback as episodic memory to guide future behaviour.”
Shinn, N. et al., 2023 - Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023
This paper formalises the reflection pattern and shows that verbal self-critique stored as memory produces performance improvements comparable to traditional reinforcement learning methods, without the need for model retraining. The practical implication: a reflection step that stores its critique as context for the next iteration is more powerful than a one-shot reflection with no memory of previous attempts.
Common misconception
“Multi-agent architectures are always more capable than single-agent architectures.”
Multi-agent architectures add orchestration complexity, increased latency from additional LLM calls, and higher cost per task. A single agent with a well-designed system prompt and a focused tool set frequently outperforms a multi-agent system on the same task, because it avoids the coordination overhead and the risk of errors compounding across agent boundaries. Start with a single agent and measure performance. Add agents only when a specific, measurable limitation requires it.
Common misconception
“The reflection pattern always improves output quality.”
Reflection improves output quality when the agent's self-critique is reliable. It can reduce quality when the agent applies the same reasoning patterns to both generation and critique, meaning it cannot see the errors it made initially. This is particularly common for factual errors embedded in fluent, plausible-sounding text. Measure the improvement before committing to reflection as an architectural decision. In some domains, an independent verification tool call produces better accuracy improvement than self-reflection at lower added cost.
A content moderation team needs to classify 10,000 user posts per day as safe, review required, or remove. Each classification must be independent. Which pattern handles this most efficiently?
You are building a travel booking agent. A user asks: 'I want to book a flight to Paris.' Which pattern is most appropriate?
For posts flagged for human review in the moderation system, you want to generate a detailed explanation for the human moderator. Which pattern applies to this second step?
Anthropic Multi-agent patterns documentation
docs.anthropic.com/en/docs/build-with-claude/agents
Anthropic's official guidance on agent patterns with Claude, including orchestration and subagent delegation. Referenced throughout Section 9.2.
langchain-ai.github.io/langgraph
Framework implementing most patterns as composable graph nodes, including the supervisor and map-reduce patterns with async execution. Referenced in Sections 9.2 and 9.3.
Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
arXiv:2308.08155
Research paper introducing the supervisor and multi-agent conversation patterns. Quoted in Section 9.2 for the multi-agent conversation model.
Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning
NeurIPS 2023
Formal analysis of the reflection pattern's effectiveness versus traditional reinforcement learning. Quoted in Section 9.3 for the episodic memory finding.
OWASP Top 10 for Large Language Model Applications 2025
LLM06:2025 Excessive Agency
Industry security standard covering risk from multi-agent architectures with unconstrained delegated authority. Referenced in Section 9.2 for supervisor pattern risks.
Module 9 of 25 · Core Concepts