Core concepts · Module 3

Memory and context

Short-Term Memory.

45 min 3 outcomes Core concepts

Previously

Tools and actions

Tools are functions that agents can call to interact with the world.

This module

Memory and context

Short-Term Memory.

Next

Design patterns

For complex tasks, planning before acting often works better than interleaved reasoning.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

Long-Term Memory.

What you will be able to do

  • 1 Distinguish between short term and long term memory and choose sensibly.
  • 2 Explain context windows and why long conversations degrade.
  • 3 Use external memory, such as a vector store, without treating it as truth.

Before you begin

  • Foundations-level understanding of this course
  • Confidence with key terms introduced in Stage 1

Common ways people get this wrong

  • Leaky recall. If retrieval is not filtered, the model can pull in private or unrelated information.
  • Stale assumptions. Old memory can override new instructions and create confusing behaviour.

Main idea at a glance

Agent Memory Architecture

Stage 1

User Message

The current user query or command

2.3.1 Types of agent memory

Short-Term Memory

Information held during a single conversation. Includes the chat history, current task details, and temporary working state. Lost when the session ends.

Long-Term Memory

Persistent information that survives across sessions. User preferences, learned facts, and historical interactions. Stored in databases.

Context Window

The maximum amount of text an LLM can process at once. For a 128K context window, roughly 96,000 words. Anything beyond this limit is simply not seen by the model.

🎯 Interactive. Memory architecture explorer

Explore the different types of agent memory through this interactive tool. Test your understanding of when to use short-term, long-term, and external memory for different queries.

Interactive lab

Memory Architecture

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

2.3.2 Managing Conversation History

As conversations grow, they eventually exceed the context window. You need strategies to handle this.

"""
Conversation Memory Management
==============================
Strategies for maintaining conversation context.
"""

from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime


@dataclass
class Message:
    """A single message in conversation history."""
    role: str  # "user", "assistant", or "system"
    content: str
    timestamp: datetime = field(default_factory=datetime.now)
    token_count: int = 0


class ConversationMemory:
    """
    Manages conversation history with context window awareness.
    """
    
    def __init__(self, max_tokens: int = 4000, preserve_system: bool = True):
        """
        Initialise conversation memory.
        
        Args:
            max_tokens: Maximum tokens to keep in history
            preserve_system: Always keep the system message
        """
        self.max_tokens = max_tokens
        self.preserve_system = preserve_system
        self.messages: List[Message] = []
        self.system_message: Optional[Message] = None
    
    def add_message(self, role: str, content: str) -> None:
        """Add a message to history."""
        # Estimate token count (roughly 4 chars per token)
        token_count = len(content) // 4
        
        message = Message(
            role=role,
            content=content,
            token_count=token_count
        )
        
        if role == "system":
            self.system_message = message
        else:
            self.messages.append(message)
        
        # Prune if needed
        self._prune_to_fit()
    
    def _prune_to_fit(self) -> None:
        """Remove old messages to fit within token limit."""
        current_tokens = sum(m.token_count for m in self.messages)
        
        if self.system_message:
            current_tokens += self.system_message.token_count
        
        # Remove oldest messages until we fit
        while current_tokens > self.max_tokens and len(self.messages) > 2:
            removed = self.messages.pop(0)
            current_tokens -= removed.token_count
    
    def get_messages(self) -> List[dict]:
        """Get messages in format suitable for LLM."""
        result = []
        
        if self.system_message:
            result.append({
                "role": "system",
                "content": self.system_message.content
            })
        
        for msg in self.messages:
            result.append({
                "role": msg.role,
                "content": msg.content
            })
        
        return result
    
    def summarise_and_compress(self, summariser_fn) -> None:
        """
        Summarise older messages to save tokens.
        
        Args:
            summariser_fn: Function that takes messages and returns summary
        """
        if len(self.messages) < 10:
            return
        
        # Take oldest 70% of messages
        split_point = int(len(self.messages) * 0.7)
        old_messages = self.messages[:split_point]
        recent_messages = self.messages[split_point:]
        
        # Summarise old messages
        old_content = "\n".join(
            f"{m.role}: {m.content}" 
            for m in old_messages
        )
        summary = summariser_fn(old_content)
        
        # Replace with summary
        summary_message = Message(
            role="system",
            content=f"[Summary of earlier conversation: {summary}]",
            token_count=len(summary) // 4
        )
        
        self.messages = [summary_message] + recent_messages

2.3.3 Vector Databases for Semantic Memory

When you need to remember things across sessions or search through large amounts of information, vector databases are essential.

Vector Database

A database that stores information as numerical vectors (lists of numbers). Similar items have similar vectors. This allows semantic search, finding things by meaning, not just keywords.

How Vector Search Works

Stage 1

Document

Raw text that needs to be stored and made searchable

2.3.4 Context engineering

Context engineering is the most important shift in how practitioners think about working with AI. It replaces the narrower idea of "prompt engineering" with something much broader and more useful.

Context engineering

The discipline of designing dynamic systems that provide the right information and tools, in the right format, at the right time, to give an AI everything it needs to accomplish a task. Most agent failures are now context failures, not model failures.

Andrej Karpathy put it well in June 2025 when he described context engineering as the art and science of filling the context window with just the right information for the next step. He proposed thinking of an LLM like a CPU and the context window as RAM. Your job as a builder is like an operating system. You load that working memory with the right code and data for the task at hand.

Context Engineering Framework

Stage 1

Write

Scratchpads, memory files, and persistent logs that store information outside the active context window

I think this is the most overlooked strategy. Writing context out and retrieving it later scales much better than trying to fit everything inline.

Four context strategies

Effective context engineering uses four complementary strategies.

Write. Save context outside the context window so it persists across turns. This includes scratchpads, memory files, persistent logs, and queryable storage. The key insight is that not everything needs to live in the active context. Write important information out, then bring it back when it is needed.

Select. Pull relevant context into the window when the agent needs it. This is what retrieval augmented generation does. Instead of hoping the model memorised the right facts, you search for relevant documents and inject them at the right moment.

Compress. Retain only the tokens required for the current task. Conversation histories grow quickly. Summarising older exchanges, discarding irrelevant tool outputs, and setting token budgets keeps the context window focused.

Isolate. Split context across multiple agents, each with a narrow sub-task focus. A research agent does not need access to the email sending tool. A code review agent does not need the deployment credentials. Isolation is both a security and a performance strategy.

2.3.5 The RAG decision hierarchy

Before reaching for fine-tuning or complex architectures, I follow a practical hierarchy that saves time and money.

The customisation hierarchy

  1. 1. Prompt engineering and context engineering

    Start here. Most problems can be solved by providing better context, clearer instructions, or relevant examples in the prompt. This costs nothing extra and takes minutes to test.

  2. 2. Retrieval augmented generation (RAG)

    If the model needs access to specific documents, data, or knowledge it was not trained on, add a retrieval layer. Hybrid search combining keyword matching (BM25) with dense retrieval gives the best results. Chunk documents at 200 to 500 words with overlap.

  3. 3. LoRA fine-tuning

    If you need the model to adopt a specific style, follow domain conventions, or handle specialised terminology consistently, fine-tune with LoRA. This adjusts a small fraction of model parameters and can be done for 5 to 15 dollars on consumer hardware.

  4. 4. Distillation

    Train a smaller, cheaper model to imitate a larger one on your specific task. Useful when you need to reduce costs at scale without losing too much quality.

  5. 5. Full training

    Only consider this if you have a unique dataset, a clear competitive advantage from a custom model, and the budget to match. This is rarely the right answer.

Common mistake

Jumping to fine-tuning too early

I see teams reach for fine-tuning before they have tried better prompts or RAG. Fine-tuning is a commitment. It requires dataset curation, evaluation, and ongoing maintenance. Always exhaust the cheaper options first.

Mental model

Context is a budget

Context windows are limited. Memory systems extend context, but they also introduce new risks.

  1. 1

    Conversation

  2. 2

    Context window

  3. 3

    Memory store

  4. 4

    Retrieval

  5. 5

    Safety policy

Assumptions to keep in mind

  • Memory is relevant. Storing everything is not memory, it is hoarding. Retrieval needs relevance, not volume.
  • Sensitive data is protected. Memory can become a shadow database. Treat it like one.

Failure modes to notice

  • Leaky recall. If retrieval is not filtered, the model can pull in private or unrelated information.
  • Stale assumptions. Old memory can override new instructions and create confusing behaviour.

Key terms

Short-Term Memory
Information held during a single conversation. Includes the chat history, current task details, and temporary working state. Lost when the session ends.
Long-Term Memory
Persistent information that survives across sessions. User preferences, learned facts, and historical interactions. Stored in databases.
Context Window
The maximum amount of text an LLM can process at once. For a 128K context window, roughly 96,000 words. Anything beyond this limit is simply not seen by the model.
Vector Database
A database that stores information as numerical vectors (lists of numbers). Similar items have similar vectors. This allows semantic search, finding things by meaning, not just keywords.
Context engineering
The discipline of designing dynamic systems that provide the right information and tools, in the right format, at the right time, to give an AI everything it needs to accomplish a task. Most agent failures are now context failures, not model failures.

Check yourself

Quick check. Context engineering

0 of 3 opened

What is the core difference between prompt engineering and context engineering

Prompt engineering focuses on crafting individual prompts. Context engineering focuses on designing dynamic systems that provide the right information and tools at the right time.

Name two context engineering strategies used here

Any two of Write, Select, Compress, or Isolate.

Why does isolation improve both security and performance

It limits each agent's context to only what it needs, reducing attack surface and keeping the context window focused on relevant information.

Quick check. Memory and context

0 of 4 opened

What is the context window

The maximum amount of text the model can consider at once. If you exceed it, earlier detail drops out.

Name three types of memory an agent might use

Short term conversation history, long term stored facts or preferences, and external memory such as a vector database or knowledge base.

Scenario. A conversation is too long. What is a sensible strategy

Summarise the older messages into a short system note, keep the recent messages, and verify important facts before acting.

What does a vector database help with

Semantic search. Finding similar content by meaning rather than exact keywords.

Artefact and reflection

Artefact

A memory strategy for one realistic agent you want to build.

Reflection

Where in your work would distinguish between short term and long term memory and choose sensibly. change a decision, and what evidence would make you trust that change?

Optional practice

Assemble context from building blocks, watch token budget allocation, and try Write, Select, Compress and Isolate strategies.