MODULE 6 OF 5 · CORE CONCEPTS

How AI Agents Think

60 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

Trace the observe-think-act loop through a multi-step task
Explain chain-of-thought (CoT) reasoning and why it measurably improves performance
Compare ReAct, Plan-and-Execute, and Reflection planning strategies by trade-off
Describe how agents select between multiple available tools using schema descriptions

Abstract neural network visualisation representing AI reasoning (photo on Unsplash)

Real-world incident · Summer 2023

Infinite loop. Zero output. Thousands of tokens consumed.

In mid-2023, AutoGPT and BabyAGI attracted wide attention as some of the first publicly accessible autonomous agent frameworks. Users assigned complex, open-ended goals, such as "research and write a market analysis report." What researchers began documenting almost immediately was a recurring failure mode: agents would enter reasoning loops, repeatedly reformulating the same sub-goal without making progress.

One documented pattern involved an agent tasked with finding the best Python library for a given purpose. The agent searched for options, found three candidates, searched for comparisons between them, found that each source recommended a different one, searched for more comparisons, and continued this cycle until the token budget was exhausted. The agent had no mechanism to detect stagnation, no step limit, and no plan that could distinguish "I need more information" from "I am going in circles."

The root cause was architectural: the agent loop had no explicit planning phase separating goal decomposition from action selection. Understanding how agents think, and specifically how reasoning is made explicit and bounded, is the starting point for building systems that terminate reliably.

If an AI agent has access to search, code execution, and file tools, what stops it reasoning in circles indefinitely? And how would you know when it had gone wrong?

The Foundations stage gave you the vocabulary, the environment, and your first API call. This stage goes deeper: you will learn how agents reason, what tools they use, how they remember, and which architectural patterns govern their behaviour. This module starts with the reasoning step itself.

With the learning outcomes established, this module begins by examining the observe-think-act loop in detail in depth.

6.1 The observe-think-act loop in detail

The agent loop introduced in the Foundations modules has a precise internal structure. At the start of each iteration, the model reads its full context: the system prompt, all prior conversation messages, any tool results from the previous step, and any content retrieved from memory. The model has no internal state between iterations. Every observation is a fresh reading of a growing context window.

In the think phase, the model reasons about the current state of the task. With chain-of-thought (CoT) prompting, this reasoning is made explicit as text before any action is chosen. Writing a plan makes better action choices more likely because intermediate reasoning tokens improve the probability distribution for subsequent tokens. This is not metaphorical: Wei et al. (2022) measured this effect empirically across arithmetic, commonsense, and multi-step tasks.

In the act phase, the agent either calls a tool by outputting structured JSON, or generates a final response. Tool calls are not executed by the model; the application layer executes them and injects the result back into the context for the next observe phase. This cycle continues until the task is complete or a safety limit is reached.

The observe-think-act loop is stateless. Every context window is a complete picture of what the agent knows. If information is not in the context, the agent cannot act on it.

With an understanding of the observe-think-act loop in detail in place, the discussion can now turn to chain-of-thought reasoning, which builds directly on these foundations.

6.2 Chain-of-thought reasoning

Chain-of-thought (CoT) reasoning is a prompting technique that elicits step-by-step reasoning from a large language model (LLM) before it produces a final answer. Introduced by Wei et al. at Google in 2022, it significantly improves performance on tasks requiring arithmetic, commonsense reasoning, and multi-step problem solving. Zero-shot CoT, the version that appends "Let's think step by step" to a prompt, works because modern LLMs have encountered enough reasoning patterns in training data to activate this behaviour without examples.

For agents, CoT is typically embedded in the system prompt or elicited via instructions that require the agent to state what it currently knows, what it still needs, and which tool it will call next. This written reasoning trace serves two purposes: it improves the action that follows, and it provides an audit trail for debugging when something goes wrong.

“We explore chain-of-thought prompting, a simple yet effective technique for eliciting chain of thought reasoning via a few chain of thought demonstrations as exemplars in prompting.”
Wei et al., 2022 - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022
This is the foundational paper establishing that CoT measurably improves model performance. The key insight is that intermediate steps in a reasoning chain improve the probability distribution for subsequent tokens, making better final answers more likely. This is why agents are designed to reason before acting.

Consider a financial query: "Compare the Q3 2024 revenue of Apple and Microsoft." Without CoT, an agent might immediately search and return the first numbers it finds. With CoT, the agent would first note that the two companies have different fiscal year calendars, plan separate searches with appropriate date qualifiers, and then note in its response that the figures represent different calendar periods. The reasoning trace catches the ambiguity before it becomes an error in the output.

“ReAct synergizes reasoning and acting in language models, generating verbal reasoning traces and text actions in an interleaved manner.”
Yao et al., 2022 - ReAct: Synergizing Reasoning and Acting in Language Models, arXiv:2210.03629
ReAct (Reasoning plus Acting) is the dominant agent loop architecture in production systems. By interleaving reasoning tokens with action tokens in a single context, the model can adapt its reasoning based on tool results in real time, rather than committing to a complete plan before any action.

With an understanding of chain-of-thought reasoning in place, the discussion can now turn to planning strategies compared, which builds directly on these foundations.

6.3 Planning strategies compared

Three planning strategies cover the majority of agent use cases. Choosing the right one for a task is a consequential design decision.

ReAct (Reasoning plus Acting) interleaves reasoning and action on every step. The agent writes a reasoning trace, calls a tool, observes the result, writes another trace, and continues. This is appropriate when the plan cannot be determined upfront because each step reveals information needed for the next, as with research tasks or exploratory queries. The risk is verbosity: many reasoning tokens accumulate costs quickly.

Plan-and-Execute generates a full plan first, then executes each step with a separate, often lighter-weight, executor. Wang et al. (2023) showed this improves performance on long-horizon tasks by separating planning from execution, allowing a capable model to plan while a faster, cheaper model executes. The risk is that the plan may become invalid if early steps produce unexpected results, and the executor may not know to update the plan.

Reflection generates an initial response, then has the agent review its own output against the original goal and correct errors before returning the final answer. Appropriate for quality-critical tasks such as writing and analysis. The risk is doubled latency and the possibility that the agent's self-critique misses the same errors it made initially.

Common misconception

“An agent that reasons more always produces better outputs.”

Reasoning quality matters more than reasoning quantity. An agent can produce long chains of plausible-sounding reasoning that still arrive at wrong conclusions, a failure mode sometimes called reasoning theatre. Plan-and-Execute with a step limit often outperforms open-ended ReAct on well-defined tasks because it forces the agent to commit to a plan rather than accumulate indeterminate reasoning. Measure task completion rate and accuracy, not reasoning length.

Common misconception

“The observe-think-act loop is unique to AI agents.”

The observe-orient-decide-act (OODA) loop, developed by military strategist John Boyd in the 1970s for jet fighter tactics, describes the same pattern. ReAct is a specific implementation of this loop for LLMs. Understanding OODA helps when designing the observe and act phases: observations must be timely and accurate, and actions must be reversible where possible. OODA also emphasises that faster loop cycles confer advantage, which maps directly to why async tool execution matters in agent design.

With an understanding of planning strategies compared in place, the discussion can now turn to tool selection and failure modes, which builds directly on these foundations.

6.4 Tool selection and failure modes

When an agent has multiple tools available, it selects between them by reading each tool's description field in the JSON schema and choosing the most relevant one for the current situation. Description quality directly affects tool selection accuracy. A weak description such as "Gets data from various sources" gives the model nothing to discriminate on. A strong description specifies when to use the tool, what it returns, and when not to use it.

Four common tool selection failure modes have clear causes and fixes. Wrong tool chosen means the descriptions overlap or the relevant one is less specific. Tool called with wrong parameters means the parameter descriptions are incomplete. No tool called when one should be means the description does not match the user's phrasing and needs synonyms or example triggers. Tool called unnecessarily means there is no "do not use when" guidance in the description.

An agent with more than fifteen to twenty tools suffers from selection confusion. The model must read all descriptions and choose correctly under a fixed attention budget. Start with the minimum set of tools that can accomplish the task. Add tools incrementally and measure whether each addition improves or degrades task completion rate before proceeding.

Tool descriptions are effectively prompts. Every hour invested in writing clear, specific descriptions pays off in reduced misrouting errors across millions of agent invocations.

6.5 Check your understanding

You are building a travel booking agent with search_flights, search_hotels, and check_calendar tools. A user asks: 'Find me flights and hotels for May 12th and make sure I have nothing in my calendar that week.' Which planning strategy fits best?

In the same travel agent, which of the three tools can run in parallel, and which must wait?

A colleague says the travel agent keeps choosing search_hotels when the user asks about flights. Given what you know about tool selection, what is the most likely cause?

Given the 2023 AutoGPT/BabyAGI infinite loop incident, which design safeguard would most directly have prevented the failure?

Key takeaways

The observe-think-act loop is stateless: all context must be in the context window on every iteration, so information not injected into the window cannot influence agent behaviour.
Chain-of-thought (CoT) reasoning improves action quality by making intermediate reasoning explicit before a tool is called, giving the model a better probability distribution over subsequent tokens.
ReAct suits exploratory tasks where the next step depends on the previous result; Plan-and-Execute suits well-defined tasks with predictable steps; Reflection suits quality-critical writing and analysis.
Tool descriptions are the primary mechanism for correct tool selection. Ambiguous or overlapping descriptions are the leading cause of tool misrouting in production agents.
Step limits and stagnation detection are essential safety features. Without them, agents can enter infinite reasoning loops that exhaust token budgets without completing a task.

Standards and sources cited in this module

Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
NeurIPS 2022
Original CoT paper; establishes empirically that intermediate reasoning steps improve LLM performance on arithmetic, commonsense, and multi-step tasks. Cited in Section 6.2.
Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models
arXiv:2210.03629
Introduces the ReAct pattern, the dominant agent loop architecture in production frameworks. Cited in Section 6.2 and the planning strategies comparison.
Wang, L. et al. (2023). Plan-and-Solve Prompting
arXiv:2305.04091
Formal analysis of plan-then-execute strategies for long-horizon tasks. Cited in Section 6.3 as evidence for Plan-and-Execute performance benefits.
OpenAI Function Calling Best Practices
platform.openai.com/docs/guides/function-calling
Practical guidance on tool schema design, including description writing and parameter definitions. Applies across providers. Cited in Section 6.4.
OWASP Top 10 for Large Language Model Applications 2025
LLM04:2025 Insufficient Input Handling and LLM08:2025 Excessive Agency
The OWASP standard for LLM application security. Referenced for tool selection failure modes and the importance of bounding agent autonomy.

Previous: Your First AI Interaction Next: Tools and Actions

Module 6 of 25 · Core Concepts

Loading lesson...

MODULE 6 OF 5 · CORE CONCEPTS

How AI Agents Think

60 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

Trace the observe-think-act loop through a multi-step task
Explain chain-of-thought (CoT) reasoning and why it measurably improves performance
Compare ReAct, Plan-and-Execute, and Reflection planning strategies by trade-off
Describe how agents select between multiple available tools using schema descriptions

Real-world incident · Summer 2023

Infinite loop. Zero output. Thousands of tokens consumed.

If an AI agent has access to search, code execution, and file tools, what stops it reasoning in circles indefinitely? And how would you know when it had gone wrong?

With the learning outcomes established, this module begins by examining the observe-think-act loop in detail in depth.

6.1 The observe-think-act loop in detail

The observe-think-act loop is stateless. Every context window is a complete picture of what the agent knows. If information is not in the context, the agent cannot act on it.

With an understanding of the observe-think-act loop in detail in place, the discussion can now turn to chain-of-thought reasoning, which builds directly on these foundations.

6.2 Chain-of-thought reasoning

“We explore chain-of-thought prompting, a simple yet effective technique for eliciting chain of thought reasoning via a few chain of thought demonstrations as exemplars in prompting.”
Wei et al., 2022 - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022
This is the foundational paper establishing that CoT measurably improves model performance. The key insight is that intermediate steps in a reasoning chain improve the probability distribution for subsequent tokens, making better final answers more likely. This is why agents are designed to reason before acting.

“ReAct synergizes reasoning and acting in language models, generating verbal reasoning traces and text actions in an interleaved manner.”
Yao et al., 2022 - ReAct: Synergizing Reasoning and Acting in Language Models, arXiv:2210.03629
ReAct (Reasoning plus Acting) is the dominant agent loop architecture in production systems. By interleaving reasoning tokens with action tokens in a single context, the model can adapt its reasoning based on tool results in real time, rather than committing to a complete plan before any action.

With an understanding of chain-of-thought reasoning in place, the discussion can now turn to planning strategies compared, which builds directly on these foundations.

6.3 Planning strategies compared

Three planning strategies cover the majority of agent use cases. Choosing the right one for a task is a consequential design decision.

Common misconception

“An agent that reasons more always produces better outputs.”

Common misconception

“The observe-think-act loop is unique to AI agents.”

With an understanding of planning strategies compared in place, the discussion can now turn to tool selection and failure modes, which builds directly on these foundations.

6.4 Tool selection and failure modes

Tool descriptions are effectively prompts. Every hour invested in writing clear, specific descriptions pays off in reduced misrouting errors across millions of agent invocations.

6.5 Check your understanding

In the same travel agent, which of the three tools can run in parallel, and which must wait?

A colleague says the travel agent keeps choosing search_hotels when the user asks about flights. Given what you know about tool selection, what is the most likely cause?

Given the 2023 AutoGPT/BabyAGI infinite loop incident, which design safeguard would most directly have prevented the failure?

Key takeaways

The observe-think-act loop is stateless: all context must be in the context window on every iteration, so information not injected into the window cannot influence agent behaviour.
Chain-of-thought (CoT) reasoning improves action quality by making intermediate reasoning explicit before a tool is called, giving the model a better probability distribution over subsequent tokens.
ReAct suits exploratory tasks where the next step depends on the previous result; Plan-and-Execute suits well-defined tasks with predictable steps; Reflection suits quality-critical writing and analysis.
Tool descriptions are the primary mechanism for correct tool selection. Ambiguous or overlapping descriptions are the leading cause of tool misrouting in production agents.
Step limits and stagnation detection are essential safety features. Without them, agents can enter infinite reasoning loops that exhaust token budgets without completing a task.

Standards and sources cited in this module

Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
NeurIPS 2022
Original CoT paper; establishes empirically that intermediate reasoning steps improve LLM performance on arithmetic, commonsense, and multi-step tasks. Cited in Section 6.2.
Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models
arXiv:2210.03629
Introduces the ReAct pattern, the dominant agent loop architecture in production frameworks. Cited in Section 6.2 and the planning strategies comparison.
Wang, L. et al. (2023). Plan-and-Solve Prompting
arXiv:2305.04091
Formal analysis of plan-then-execute strategies for long-horizon tasks. Cited in Section 6.3 as evidence for Plan-and-Execute performance benefits.
OpenAI Function Calling Best Practices
platform.openai.com/docs/guides/function-calling
Practical guidance on tool schema design, including description writing and parameter definitions. Applies across providers. Cited in Section 6.4.
OWASP Top 10 for Large Language Model Applications 2025
LLM04:2025 Insufficient Input Handling and LLM08:2025 Excessive Agency
The OWASP standard for LLM application security. Referenced for tool selection failure modes and the importance of bounding agent autonomy.

Previous: Your First AI Interaction Next: Tools and Actions

Module 6 of 25 · Core Concepts