Core concepts · Module 3
Memory and context
Short-Term Memory.
Previously
Tools and actions
Tools are functions that agents can call to interact with the world.
This module
Memory and context
Short-Term Memory.
Next
Design patterns
For complex tasks, planning before acting often works better than interleaved reasoning.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
Long-Term Memory.
What you will be able to do
- 1 Distinguish between short term and long term memory and choose sensibly.
- 2 Explain context windows and why long conversations degrade.
- 3 Use external memory, such as a vector store, without treating it as truth.
Before you begin
- Foundations-level understanding of this course
- Confidence with key terms introduced in Stage 1
Common ways people get this wrong
- Leaky recall. If retrieval is not filtered, the model can pull in private or unrelated information.
- Stale assumptions. Old memory can override new instructions and create confusing behaviour.
Main idea at a glance
Agent Memory Architecture
Stage 1
User Message
The current user query or command
2.3.1 Types of agent memory
Short-Term Memory
Information held during a single conversation. Includes the chat history, current task details, and temporary working state. Lost when the session ends.
Long-Term Memory
Persistent information that survives across sessions. User preferences, learned facts, and historical interactions. Stored in databases.
Context Window
The maximum amount of text an LLM can process at once. For a 128K context window, roughly 96,000 words. Anything beyond this limit is simply not seen by the model.
🎯 Interactive. Memory architecture explorer
Explore the different types of agent memory through this interactive tool. Test your understanding of when to use short-term, long-term, and external memory for different queries.
Interactive lab
Memory Architecture
This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.
2.3.2 Managing Conversation History
As conversations grow, they eventually exceed the context window. You need strategies to handle this.
"""
Conversation Memory Management
==============================
Strategies for maintaining conversation context.
"""
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime
@dataclass
class Message:
"""A single message in conversation history."""
role: str # "user", "assistant", or "system"
content: str
timestamp: datetime = field(default_factory=datetime.now)
token_count: int = 0
class ConversationMemory:
"""
Manages conversation history with context window awareness.
"""
def __init__(self, max_tokens: int = 4000, preserve_system: bool = True):
"""
Initialise conversation memory.
Args:
max_tokens: Maximum tokens to keep in history
preserve_system: Always keep the system message
"""
self.max_tokens = max_tokens
self.preserve_system = preserve_system
self.messages: List[Message] = []
self.system_message: Optional[Message] = None
def add_message(self, role: str, content: str) -> None:
"""Add a message to history."""
# Estimate token count (roughly 4 chars per token)
token_count = len(content) // 4
message = Message(
role=role,
content=content,
token_count=token_count
)
if role == "system":
self.system_message = message
else:
self.messages.append(message)
# Prune if needed
self._prune_to_fit()
def _prune_to_fit(self) -> None:
"""Remove old messages to fit within token limit."""
current_tokens = sum(m.token_count for m in self.messages)
if self.system_message:
current_tokens += self.system_message.token_count
# Remove oldest messages until we fit
while current_tokens > self.max_tokens and len(self.messages) > 2:
removed = self.messages.pop(0)
current_tokens -= removed.token_count
def get_messages(self) -> List[dict]:
"""Get messages in format suitable for LLM."""
result = []
if self.system_message:
result.append({
"role": "system",
"content": self.system_message.content
})
for msg in self.messages:
result.append({
"role": msg.role,
"content": msg.content
})
return result
def summarise_and_compress(self, summariser_fn) -> None:
"""
Summarise older messages to save tokens.
Args:
summariser_fn: Function that takes messages and returns summary
"""
if len(self.messages) < 10:
return
# Take oldest 70% of messages
split_point = int(len(self.messages) * 0.7)
old_messages = self.messages[:split_point]
recent_messages = self.messages[split_point:]
# Summarise old messages
old_content = "\n".join(
f"{m.role}: {m.content}"
for m in old_messages
)
summary = summariser_fn(old_content)
# Replace with summary
summary_message = Message(
role="system",
content=f"[Summary of earlier conversation: {summary}]",
token_count=len(summary) // 4
)
self.messages = [summary_message] + recent_messages2.3.3 Vector Databases for Semantic Memory
When you need to remember things across sessions or search through large amounts of information, vector databases are essential.
Vector Database
A database that stores information as numerical vectors (lists of numbers). Similar items have similar vectors. This allows semantic search, finding things by meaning, not just keywords.
How Vector Search Works
Stage 1
Document
Raw text that needs to be stored and made searchable
2.3.4 Context engineering
Context engineering is the most important shift in how practitioners think about working with AI. It replaces the narrower idea of "prompt engineering" with something much broader and more useful.
Context engineering
The discipline of designing dynamic systems that provide the right information and tools, in the right format, at the right time, to give an AI everything it needs to accomplish a task. Most agent failures are now context failures, not model failures.
Andrej Karpathy put it well in June 2025 when he described context engineering as the art and science of filling the context window with just the right information for the next step. He proposed thinking of an LLM like a CPU and the context window as RAM. Your job as a builder is like an operating system. You load that working memory with the right code and data for the task at hand.
Context Engineering Framework
Stage 1
Write
Scratchpads, memory files, and persistent logs that store information outside the active context window
I think this is the most overlooked strategy. Writing context out and retrieving it later scales much better than trying to fit everything inline.
Four context strategies
Effective context engineering uses four complementary strategies.
Write. Save context outside the context window so it persists across turns. This includes scratchpads, memory files, persistent logs, and queryable storage. The key insight is that not everything needs to live in the active context. Write important information out, then bring it back when it is needed.
Select. Pull relevant context into the window when the agent needs it. This is what retrieval augmented generation does. Instead of hoping the model memorised the right facts, you search for relevant documents and inject them at the right moment.
Compress. Retain only the tokens required for the current task. Conversation histories grow quickly. Summarising older exchanges, discarding irrelevant tool outputs, and setting token budgets keeps the context window focused.
Isolate. Split context across multiple agents, each with a narrow sub-task focus. A research agent does not need access to the email sending tool. A code review agent does not need the deployment credentials. Isolation is both a security and a performance strategy.
2.3.5 The RAG decision hierarchy
Before reaching for fine-tuning or complex architectures, I follow a practical hierarchy that saves time and money.
The customisation hierarchy
-
1. Prompt engineering and context engineering
Start here. Most problems can be solved by providing better context, clearer instructions, or relevant examples in the prompt. This costs nothing extra and takes minutes to test.
-
2. Retrieval augmented generation (RAG)
If the model needs access to specific documents, data, or knowledge it was not trained on, add a retrieval layer. Hybrid search combining keyword matching (BM25) with dense retrieval gives the best results. Chunk documents at 200 to 500 words with overlap.
-
3. LoRA fine-tuning
If you need the model to adopt a specific style, follow domain conventions, or handle specialised terminology consistently, fine-tune with LoRA. This adjusts a small fraction of model parameters and can be done for 5 to 15 dollars on consumer hardware.
-
4. Distillation
Train a smaller, cheaper model to imitate a larger one on your specific task. Useful when you need to reduce costs at scale without losing too much quality.
-
5. Full training
Only consider this if you have a unique dataset, a clear competitive advantage from a custom model, and the budget to match. This is rarely the right answer.
Common mistake
Jumping to fine-tuning too early
I see teams reach for fine-tuning before they have tried better prompts or RAG. Fine-tuning is a commitment. It requires dataset curation, evaluation, and ongoing maintenance. Always exhaust the cheaper options first.
Mental model
Context is a budget
Context windows are limited. Memory systems extend context, but they also introduce new risks.
-
1
Conversation
-
2
Context window
-
3
Memory store
-
4
Retrieval
-
5
Safety policy
Assumptions to keep in mind
- Memory is relevant. Storing everything is not memory, it is hoarding. Retrieval needs relevance, not volume.
- Sensitive data is protected. Memory can become a shadow database. Treat it like one.
Failure modes to notice
- Leaky recall. If retrieval is not filtered, the model can pull in private or unrelated information.
- Stale assumptions. Old memory can override new instructions and create confusing behaviour.
Key terms
- Short-Term Memory
- Information held during a single conversation. Includes the chat history, current task details, and temporary working state. Lost when the session ends.
- Long-Term Memory
- Persistent information that survives across sessions. User preferences, learned facts, and historical interactions. Stored in databases.
- Context Window
- The maximum amount of text an LLM can process at once. For a 128K context window, roughly 96,000 words. Anything beyond this limit is simply not seen by the model.
- Vector Database
- A database that stores information as numerical vectors (lists of numbers). Similar items have similar vectors. This allows semantic search, finding things by meaning, not just keywords.
- Context engineering
- The discipline of designing dynamic systems that provide the right information and tools, in the right format, at the right time, to give an AI everything it needs to accomplish a task. Most agent failures are now context failures, not model failures.
Check yourself
Quick check. Context engineering
0 of 3 opened
What is the core difference between prompt engineering and context engineering
Prompt engineering focuses on crafting individual prompts. Context engineering focuses on designing dynamic systems that provide the right information and tools at the right time.
Name two context engineering strategies used here
Any two of Write, Select, Compress, or Isolate.
Why does isolation improve both security and performance
It limits each agent's context to only what it needs, reducing attack surface and keeping the context window focused on relevant information.
Quick check. Memory and context
0 of 4 opened
What is the context window
The maximum amount of text the model can consider at once. If you exceed it, earlier detail drops out.
Name three types of memory an agent might use
Short term conversation history, long term stored facts or preferences, and external memory such as a vector database or knowledge base.
Scenario. A conversation is too long. What is a sensible strategy
Summarise the older messages into a short system note, keep the recent messages, and verify important facts before acting.
What does a vector database help with
Semantic search. Finding similar content by meaning rather than exact keywords.
Artefact and reflection
Artefact
A memory strategy for one realistic agent you want to build.
Reflection
Where in your work would distinguish between short term and long term memory and choose sensibly. change a decision, and what evidence would make you trust that change?
Optional practice
Assemble context from building blocks, watch token budget allocation, and try Write, Select, Compress and Isolate strategies.