MODULE 11 OF 5 · PRACTICAL BUILDING

Building Your First Agent

60 min read 4 outcomes Code walkthrough + quiz

By the end of this module you will be able to:

Build a functional single-agent system with tool use in Python
Implement the agent loop with proper stop conditions and error handling
Debug agent behaviour using structured logging and trace inspection
Write unit tests for individual tools and integration tests for the full agent loop

Code on a computer screen (photo on Unsplash)

Real-world deployment · March 2024

Devin fixes a real GitHub issue without being told how.

In March 2024, Cognition AI demonstrated Devin, an AI software engineering agent that could take an unresolved GitHub issue, explore a repository, write a fix, run the tests, and open a pull request. The demonstration used a real repository and a real open issue, not a prepared scenario.

What made this impressive was not the large language model (LLM) at the centre but the scaffolding around it. Devin had tools: a shell, a code editor, a browser, and a test runner. It had a loop that called those tools, observed the results, and decided what to do next. It had stop conditions and error handling for when a tool call failed.

That scaffolding, the agent loop, is what this module teaches you to build from scratch. The model provides reasoning. You provide the structure that turns reasoning into reliable, repeatable action.

What exactly happens inside the agent between receiving an issue description and submitting a pull request? And what would break first if you built something similar yourself?

The Core Concepts stage gave you the theory. This stage turns theory into running code. You will build a complete agent from scratch, implementing the ReAct pattern with real tools, real error handling, and real debugging.

With the learning outcomes established, this module begins by examining project structure before writing any code in depth.

11.1 Project structure before writing any code

Resist the temptation to put everything in one file. Separating tool implementations, tool schemas, and the agent loop pays dividends the moment you need to debug why the agent chose the wrong tool or passed the wrong argument. The structure below treats tools as ordinary Python functions that can be unit tested independently, before the agent loop ever runs.

Schemas are declared in their own file so you can update a description without touching the implementation. The agent loop in agent.py contains no business logic: it only drives the conversation between the model and the tools.

research-agent/
├── .env                    # API keys (never commit)
├── .gitignore
├── requirements.txt
├── src/
│   ├── agent.py            # Agent loop only
│   ├── tools.py            # Tool implementations
│   ├── schemas.py          # Tool schemas for the API
│   └── utils.py            # Logging helpers
└── tests/
    ├── test_tools.py       # Unit tests per tool
    └── test_agent.py       # Integration test

“The hardest part of building an agent is not the model call. It is the scaffolding that decides when to stop, what to do when a tool fails, and how to give the model enough context to make a good decision.”
Anthropic Engineering, 2024 - Building Effective Agents
This is not a problem the model solves for you. The developer controls the loop, the stop conditions, and the error boundaries. Getting these right is the practical work of this module.

With an understanding of project structure before writing any code in place, the discussion can now turn to define and test tools before touching the agent, which builds directly on these foundations.

11.2 Define and test tools before touching the agent

Tools are ordinary Python functions. Build and test them completely before connecting them to the agent loop. This matters because a broken tool produces a confusing mid-run result, whereas a failing unit test tells you immediately which function is wrong and why.

One safety rule applies to every tool that evaluates expressions: never pass model-generated strings to Python's built-in eval(). The model generates tool arguments. A prompt injection attack could cause the model to generate a malicious expression. Use a restricted evaluator such as simpleeval (available via pip) that allows only arithmetic operations, blocking file access and imports entirely.

“Never use eval() to execute model-generated code. If you pass model-generated arguments directly to Python's built-in eval(), a prompt injection attack could execute arbitrary code on your system.”
OWASP Top 10 for Agentic Applications, 2025 - LLM02: Insecure Output Handling
This is not a theoretical risk. Any tool that accepts a string and evaluates it is a direct injection path. The fix is to use a restricted evaluator or a purpose-built calculation API that explicitly limits what operations are allowed.

Write unit tests for each tool independently. Run them before you start the agent loop. This isolates failures and shortens feedback cycles dramatically.

# tests/test_tools.py
import pytest
from src.tools import search_web, calculate, write_report

def test_calculate_addition():
    result = calculate("2 + 2")
    assert result["result"] == 4

def test_calculate_malformed_expression():
    result = calculate("not_a_number + 5")
    assert "error" in result

def test_write_report_structure():
    result = write_report(
        title="Test Report",
        sections=[{"heading": "Introduction", "content": "Content here."}]
    )
    assert "# Test Report" in result["report"]
    assert result["sections"] == 1

With an understanding of define and test tools before touching the agent in place, the discussion can now turn to schema quality is the primary lever for agent behaviour, which builds directly on these foundations.

11.3 Schema quality is the primary lever for agent behaviour

The tool schema is what the model reads to decide whether and how to call a tool. Vague descriptions produce wrong tool selection. Missing parameter descriptions produce wrong arguments. Schema quality determines agent behaviour more directly than any system prompt trick.

Two description patterns matter most. First, say what the tool is for and what it is not for: "Search the web for current information. Do not use for calculations." Second, include a sequencing hint on completion tools: "Use this as the final step after all searches are complete. Do not call until research is finished." Without that signal, the agent has no clear completion condition.

# src/schemas.py
TOOL_SCHEMAS = [
    {
        "name": "search_web",
        "description": "Search the web for current information. Use this when you need facts or recent news. Do not use for calculations.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "A specific, focused search query. Use precise terms."
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "write_report",
        "description": "Compile research into a structured report. Use this as the FINAL step after all searches are done. Do not call until research is complete.",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "sections": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "heading": {"type": "string"},
                            "content": {"type": "string"}
                        }
                    }
                }
            },
            "required": ["title", "sections"]
        }
    }
]

Common misconception

“Better system prompts always fix incorrect tool selection.”

System prompts are read once at the top of the conversation. Tool descriptions are read every time the model considers calling a tool. Fix incorrect tool selection by rewriting the tool description first. Add 'Do not use this tool for X' when the model calls the wrong tool for a job. The system prompt is a secondary lever.

With an understanding of schema quality is the primary lever for agent behaviour in place, the discussion can now turn to the agent loop: a complete implementation, which builds directly on these foundations.

11.4 The agent loop: a complete implementation

The agent loop has three states: the model generates a final response (stop_reason == "end_turn"), the model requests tool calls (stop_reason == "tool_use"), or the safety step limit is reached. Every production agent needs all three states handled explicitly.

Set MAX_STEPS as a hard limit before you begin. Without it, a misbehaving agent can loop indefinitely, consuming tokens and incurring costs until you manually terminate the process.

# src/agent.py
import json, logging, anthropic
from src.tools import search_web, calculate, write_report
from src.schemas import TOOL_SCHEMAS
from dotenv import load_dotenv

load_dotenv()
logger = logging.getLogger(__name__)
client = anthropic.Anthropic()

TOOL_REGISTRY = {
    "search_web": search_web,
    "calculate": calculate,
    "write_report": write_report
}
MAX_STEPS = 15

def run_agent(user_request: str) -> str:
    messages = [{"role": "user", "content": user_request}]
    step = 0

    while step < MAX_STEPS:
        step += 1
        logger.info(f"Step {step}/{MAX_STEPS}")
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            system="You are a research assistant...",
            tools=TOOL_SCHEMAS,
            messages=messages
        )
        logger.info(f"Stop reason: {response.stop_reason}")

        if response.stop_reason == "end_turn":
            text_blocks = [b for b in response.content if hasattr(b, 'text')]
            return text_blocks[0].text if text_blocks else "Task completed."

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []
            for block in response.content:
                if block.type != "tool_use":
                    continue
                logger.info(f"Tool call: {block.name}")
                fn = TOOL_REGISTRY.get(block.name)
                if fn:
                    try:
                        result = fn(**block.input)
                    except Exception as exc:
                        result = {"error": f"{type(exc).__name__}: {str(exc)}"}
                else:
                    result = {"error": f"Unknown tool: {block.name}"}
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)
                })
            messages.append({"role": "user", "content": tool_results})

    return "Unable to complete the task within the step limit."}

The step limit is a safety belt, not a design goal. If your agent routinely hits the limit, the problem is in the system prompt or tool descriptions, not the limit itself. Raise the limit only after fixing the underlying cause.

With an understanding of the agent loop: a complete implementation in place, the discussion can now turn to debugging with structured logs, which builds directly on these foundations.

11.5 Debugging with structured logs

Structured logging turns an opaque agent run into a readable decision trace. Log the step number, the stop reason, the tool name, and the result keys returned. Reading this sequence tells you whether the agent is choosing the right tools, passing sensible arguments, and making progress toward the goal.

When the agent loops to the step limit without producing a result, inspect the last three tool calls. The most common causes: a completion condition the model cannot satisfy (rewrite the system prompt to clarify what "done" means); a tool returning an error the model interprets as a reason to keep searching; two tools with overlapping descriptions causing the model to alternate between them.

Common misconception

“If the agent reaches the step limit, just raise MAX_STEPS.”

Reaching the step limit is a symptom, not the problem. Inspect the last few tool calls in the log. The agent is either missing a clear completion signal, receiving a tool error it cannot handle, or choosing the wrong tool repeatedly. Fix the description or system prompt that causes the loop before touching the limit.

11.6 Check your understanding

You are building a research agent. It reaches step 15 and returns 'unable to complete.' The logs show it calls search_web repeatedly but never calls write_report. What is the most likely cause?

You add a new send_notification tool with the description: 'Sends a notification.' During a research task, the agent starts calling it unexpectedly. What is the best fix?

What is the correct testing order for a new agent project?

You are deciding whether to use Python's built-in eval() to execute mathematical expressions passed as tool arguments by the model. What should you do?

Key takeaways

Separate tools, schemas, and the agent loop into distinct files from the start: this makes debugging faster and testing simpler.
Never pass model-generated strings to Python's built-in eval(): use a restricted evaluator such as simpleeval or a dedicated calculation API.
Always set MAX_STEPS and log each step with tool name and result keys: this is your primary debugging instrument.
Unit test every tool independently before testing the complete agent loop: tool failures and agent loop failures require different fixes.
Tool description quality is the primary lever for fixing incorrect agent behaviour: update descriptions before touching the system prompt or step limit.

Standards and sources cited in this module

Anthropic, 'Building Effective Agents' (2024)
Agents section: tool use patterns and agent loop design
Primary reference for the agent loop structure, tool schema patterns, and stop condition handling used throughout this module.
Anthropic Tool Use Documentation
docs.anthropic.com/en/docs/build-with-claude/tool-use
Official format reference for tool schemas and result injection. Consulted for the TOOL_SCHEMAS and tool_results patterns in Section 11.4.
OWASP Top 10 for Large Language Model Applications 2025
LLM02: Insecure Output Handling
Authoritative source for the eval() injection risk discussed in Section 11.2. Defines the category of insecure output handling in agentic contexts.
simpleeval library
github.com/danthedeckie/simpleeval
Recommended restricted expression evaluator for safe arithmetic in agent tools. Cited in Section 11.2 as the safe alternative to Python's eval().
pytest documentation
docs.pytest.org: Getting Started
Standard Python testing framework. Unit test patterns in Section 11.2 follow pytest conventions.

Previous: Architecture Fundamentals Next: Multi-Agent Systems

Module 11 of 25 · Practical Building

Loading lesson...

MODULE 11 OF 5 · PRACTICAL BUILDING

Building Your First Agent

60 min read 4 outcomes Code walkthrough + quiz

By the end of this module you will be able to:

Build a functional single-agent system with tool use in Python
Implement the agent loop with proper stop conditions and error handling
Debug agent behaviour using structured logging and trace inspection
Write unit tests for individual tools and integration tests for the full agent loop

Real-world deployment · March 2024

Devin fixes a real GitHub issue without being told how.

That scaffolding, the agent loop, is what this module teaches you to build from scratch. The model provides reasoning. You provide the structure that turns reasoning into reliable, repeatable action.

What exactly happens inside the agent between receiving an issue description and submitting a pull request? And what would break first if you built something similar yourself?

With the learning outcomes established, this module begins by examining project structure before writing any code in depth.

11.1 Project structure before writing any code

research-agent/
├── .env                    # API keys (never commit)
├── .gitignore
├── requirements.txt
├── src/
│   ├── agent.py            # Agent loop only
│   ├── tools.py            # Tool implementations
│   ├── schemas.py          # Tool schemas for the API
│   └── utils.py            # Logging helpers
└── tests/
    ├── test_tools.py       # Unit tests per tool
    └── test_agent.py       # Integration test

“The hardest part of building an agent is not the model call. It is the scaffolding that decides when to stop, what to do when a tool fails, and how to give the model enough context to make a good decision.”
Anthropic Engineering, 2024 - Building Effective Agents
This is not a problem the model solves for you. The developer controls the loop, the stop conditions, and the error boundaries. Getting these right is the practical work of this module.

11.2 Define and test tools before touching the agent

“Never use eval() to execute model-generated code. If you pass model-generated arguments directly to Python's built-in eval(), a prompt injection attack could execute arbitrary code on your system.”
OWASP Top 10 for Agentic Applications, 2025 - LLM02: Insecure Output Handling
This is not a theoretical risk. Any tool that accepts a string and evaluates it is a direct injection path. The fix is to use a restricted evaluator or a purpose-built calculation API that explicitly limits what operations are allowed.

Write unit tests for each tool independently. Run them before you start the agent loop. This isolates failures and shortens feedback cycles dramatically.

# tests/test_tools.py
import pytest
from src.tools import search_web, calculate, write_report

def test_calculate_addition():
    result = calculate("2 + 2")
    assert result["result"] == 4

def test_calculate_malformed_expression():
    result = calculate("not_a_number + 5")
    assert "error" in result

def test_write_report_structure():
    result = write_report(
        title="Test Report",
        sections=[{"heading": "Introduction", "content": "Content here."}]
    )
    assert "# Test Report" in result["report"]
    assert result["sections"] == 1

11.3 Schema quality is the primary lever for agent behaviour

# src/schemas.py
TOOL_SCHEMAS = [
    {
        "name": "search_web",
        "description": "Search the web for current information. Use this when you need facts or recent news. Do not use for calculations.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "A specific, focused search query. Use precise terms."
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "write_report",
        "description": "Compile research into a structured report. Use this as the FINAL step after all searches are done. Do not call until research is complete.",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "sections": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "heading": {"type": "string"},
                            "content": {"type": "string"}
                        }
                    }
                }
            },
            "required": ["title", "sections"]
        }
    }
]

Common misconception

“Better system prompts always fix incorrect tool selection.”

11.4 The agent loop: a complete implementation

Set MAX_STEPS as a hard limit before you begin. Without it, a misbehaving agent can loop indefinitely, consuming tokens and incurring costs until you manually terminate the process.

# src/agent.py
import json, logging, anthropic
from src.tools import search_web, calculate, write_report
from src.schemas import TOOL_SCHEMAS
from dotenv import load_dotenv

load_dotenv()
logger = logging.getLogger(__name__)
client = anthropic.Anthropic()

TOOL_REGISTRY = {
    "search_web": search_web,
    "calculate": calculate,
    "write_report": write_report
}
MAX_STEPS = 15

def run_agent(user_request: str) -> str:
    messages = [{"role": "user", "content": user_request}]
    step = 0

    while step < MAX_STEPS:
        step += 1
        logger.info(f"Step {step}/{MAX_STEPS}")
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            system="You are a research assistant...",
            tools=TOOL_SCHEMAS,
            messages=messages
        )
        logger.info(f"Stop reason: {response.stop_reason}")

        if response.stop_reason == "end_turn":
            text_blocks = [b for b in response.content if hasattr(b, 'text')]
            return text_blocks[0].text if text_blocks else "Task completed."

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []
            for block in response.content:
                if block.type != "tool_use":
                    continue
                logger.info(f"Tool call: {block.name}")
                fn = TOOL_REGISTRY.get(block.name)
                if fn:
                    try:
                        result = fn(**block.input)
                    except Exception as exc:
                        result = {"error": f"{type(exc).__name__}: {str(exc)}"}
                else:
                    result = {"error": f"Unknown tool: {block.name}"}
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)
                })
            messages.append({"role": "user", "content": tool_results})

    return "Unable to complete the task within the step limit."}

With an understanding of the agent loop: a complete implementation in place, the discussion can now turn to debugging with structured logs, which builds directly on these foundations.

11.5 Debugging with structured logs

Common misconception

“If the agent reaches the step limit, just raise MAX_STEPS.”

11.6 Check your understanding

You are building a research agent. It reaches step 15 and returns 'unable to complete.' The logs show it calls search_web repeatedly but never calls write_report. What is the most likely cause?

You add a new send_notification tool with the description: 'Sends a notification.' During a research task, the agent starts calling it unexpectedly. What is the best fix?

What is the correct testing order for a new agent project?

You are deciding whether to use Python's built-in eval() to execute mathematical expressions passed as tool arguments by the model. What should you do?

Key takeaways

Separate tools, schemas, and the agent loop into distinct files from the start: this makes debugging faster and testing simpler.
Never pass model-generated strings to Python's built-in eval(): use a restricted evaluator such as simpleeval or a dedicated calculation API.
Always set MAX_STEPS and log each step with tool name and result keys: this is your primary debugging instrument.
Unit test every tool independently before testing the complete agent loop: tool failures and agent loop failures require different fixes.
Tool description quality is the primary lever for fixing incorrect agent behaviour: update descriptions before touching the system prompt or step limit.

Standards and sources cited in this module

Anthropic, 'Building Effective Agents' (2024)
Agents section: tool use patterns and agent loop design
Primary reference for the agent loop structure, tool schema patterns, and stop condition handling used throughout this module.
Anthropic Tool Use Documentation
docs.anthropic.com/en/docs/build-with-claude/tool-use
Official format reference for tool schemas and result injection. Consulted for the TOOL_SCHEMAS and tool_results patterns in Section 11.4.
OWASP Top 10 for Large Language Model Applications 2025
LLM02: Insecure Output Handling
Authoritative source for the eval() injection risk discussed in Section 11.2. Defines the category of insecure output handling in agentic contexts.
simpleeval library
github.com/danthedeckie/simpleeval
Recommended restricted expression evaluator for safe arithmetic in agent tools. Cited in Section 11.2 as the safe alternative to Python's eval().
pytest documentation
docs.pytest.org: Getting Started
Standard Python testing framework. Unit test patterns in Section 11.2 follow pytest conventions.

Previous: Architecture Fundamentals Next: Multi-Agent Systems

Module 11 of 25 · Practical Building