MODULE 8 OF 5 · CORE CONCEPTS

Memory and Context

45 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

Distinguish between in-context, external, and semantic memory in agents
Explain how vector embeddings enable semantic search over large document collections
Describe the stages of a retrieval-augmented generation (RAG) pipeline
Choose a memory strategy based on latency, cost, and access pattern requirements

Data centre server rows representing memory storage and retrieval (photo on Unsplash)

Real-world incident · April 2023

Samsung engineers pasted proprietary chip source code into ChatGPT. Three times.

In April 2023, Samsung Electronics reported that employees had accidentally leaked sensitive proprietary information via ChatGPT in three separate incidents within twenty days. In the first, an engineer pasted semiconductor equipment measurement data into ChatGPT to ask for help identifying a defect. In the second, an engineer submitted source code from a chip database programme and asked for optimisation suggestions. In the third, a meeting recording was submitted and the engineer asked for a summary.

OpenAI's ChatGPT, at that time, used submitted conversations as training data by default. The proprietary source code, measurement data, and meeting contents entered OpenAI's servers and, depending on the data handling settings in effect, may have been incorporated into model training. Samsung subsequently banned internal use of generative AI tools on company devices while it developed internal guidelines.

The architectural lesson is precise. The engineers needed AI assistance with real proprietary data. The correct architecture for this use case is a retrieval-augmented generation (RAG) pipeline backed by an internally hosted vector database: proprietary documents are stored and searched within the organisation's own infrastructure, and only the retrieved relevant excerpts enter the context window sent to an external model. The proprietary code never leaves the building.

The engineers were trying to solve a legitimate problem. What does this incident reveal about the relationship between the context window and data governance, and what architectural choice would have kept the proprietary code inside the organisation?

Tools let agents act; memory lets them remember. This module covers the strategies agents use to maintain context across turns - from simple message history to vector-based semantic retrieval - and the security implications of each approach.

With the learning outcomes established, this module begins by examining three kinds of agent memory in depth.

8.1 Three kinds of agent memory

An LLM's context window is its working memory. It holds the current conversation, tool results, and any documents injected directly. But it is finite, expensive per token, and cleared between sessions. A customer support agent that cannot recall a customer's issue from three days ago, or a research assistant that re-reads the same documents on every query, is not useful in production.

Agent memory divides into three types, each with different characteristics. In-context memory is everything currently in the context window: immediately accessible, no retrieval step needed, but limited by the context window size (8K to 1M tokens depending on model) and cleared between sessions. External memory is structured storage outside the model, such as relational databases or key-value stores, accessed via tool calls. Appropriate for customer records, conversation history between sessions, and configuration data queried predictably. Semantic memory is a vector database that stores and searches data by meaning rather than by exact match. Appropriate for large document corpora where the relevant content cannot be predicted from the query text alone.

Choosing the wrong memory type for a task is one of the most common architectural errors in agent design. Using in-context memory for a 50,000-document corpus is impossible. Using a vector database for a customer ID lookup is unnecessarily slow and expensive.

With an understanding of three kinds of agent memory in place, the discussion can now turn to embeddings and semantic search, which builds directly on these foundations.

8.2 Embeddings and semantic search

An embedding is a numerical vector representation of text. Texts with similar meanings have vectors that are close together in high-dimensional space. An embedding model, such as OpenAI's text-embedding-3-small or Anthropic's voyage-3, converts text into these vectors. Given a query, a vector database returns the documents whose vectors are closest to the query vector, measured by cosine similarity or dot product.

This is what makes semantic search different from keyword search. The query "How do I cancel my subscription?" and the document section "Account Termination: To close your account, navigate to Settings" share no identical words, but their embedding vectors will be close in semantic space. A keyword search would find no match. A vector search will rank the document highly.

Embedding models vary in dimension (the length of the vector), cost, and domain specialisation. OpenAI's text-embedding-3-small produces 1,536-dimension vectors at low cost and performs well across general use cases. Anthropic's voyage-3 produces 1,024-dimension vectors and is optimised for code and technical content. Cohere's embed-multilingual-v3.0 is strong for multilingual deployments. Matching the embedding model to the domain of your corpus meaningfully improves retrieval accuracy.

“We propose retrieval-augmented generation (RAG) for knowledge-intensive NLP tasks. RAG models combine pre-trained parametric and non-parametric memory for language generation.”
Lewis, P. et al., 2020 - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020
This is the foundational paper establishing RAG as an architecture. The key insight is that LLMs have two kinds of knowledge: parametric (baked into weights during training) and non-parametric (retrieved at inference time). RAG combines both, letting a model give accurate, current answers by retrieving specific documents rather than relying on potentially outdated training data.

With an understanding of embeddings and semantic search in place, the discussion can now turn to the rag pipeline in stages, which builds directly on these foundations.

8.3 The RAG pipeline in stages

A retrieval-augmented generation (RAG) pipeline has two distinct phases: ingestion and retrieval. Ingestion runs once (and again whenever the document corpus changes). Retrieval runs on every query.

During ingestion, documents are split into chunks of roughly 200 to 500 tokens, with overlap between adjacent chunks to preserve context at boundaries. Each chunk is converted to an embedding vector using an embedding model. The vectors and their associated text chunks are stored in a vector database such as Pinecone (for production scale) or Chroma (for local development).

During retrieval, the user query is embedded using the same model used during ingestion. The vector database performs an approximate nearest-neighbour search and returns the top-k most semantically similar chunks. These chunks are injected into the context window as supporting documents, and the LLM (large language model) generates a response grounded in that retrieved content.

The choice of top-k is a precision-recall trade-off. Top-1 retrieval maximises precision: the most relevant document is returned, but if the query is ambiguous or the most relevant document is not the top result, accuracy suffers. Top-5 or top-10 retrieval increases recall: more relevant documents are likely to be included, but the context window fills faster, cost increases, and the model must reason over more potentially irrelevant content.

“The key to effective RAG is not the retrieval itself but the quality of the chunks. Poor chunking strategies cause the most relevant information to span a chunk boundary and be split between two results.”
Pinecone documentation, 2024 - docs.pinecone.io, RAG best practices: chunking strategies
Chunking is where most RAG pipelines fail in practice. If a policy paragraph is split mid-sentence, neither chunk contains the complete relevant information. Overlapping chunks (where the end of one chunk repeats the beginning of the next) mitigate this. Semantic chunking, which splits at natural semantic boundaries rather than at fixed token counts, produces the best retrieval accuracy but is more expensive to implement.

With an understanding of the rag pipeline in stages in place, the discussion can now turn to conversation history strategies, which builds directly on these foundations.

8.4 Conversation history strategies

Long conversations exhaust context windows. Three strategies manage this, with different trade-offs. Full history appends every message. Simple to implement, but eventually hits the context limit. Appropriate only for short, task-focused conversations. Sliding window keeps only the last N messages, always including the system prompt. Fast and cheap, but loses early context. Appropriate when early context is not needed for later decisions. Summarisation compresses older messages into a summary paragraph when the conversation grows long, then keeps only the summary and the most recent N messages verbatim.

Summarisation preserves more information than a sliding window but introduces loss. Critical constraints stated early in a conversation, such as "use Go, not Python," may be omitted from the summary. Always keep the last N messages verbatim to preserve recent context, and summarise only older sections. Never summarise the system prompt.

Common misconception

“A larger context window eliminates the need for external memory.”

Context window size affects what can be held in one session. It does not eliminate the need for persistence between sessions, access control over sensitive documents, or the ability to search a corpus larger than any single context window. A 1-million-token context window can hold roughly 750,000 words, or about 1,500 pages. A legal firm with 50,000 case files representing millions of pages cannot fit that in a context window. RAG addresses scale; context window size addresses convenience for medium-length tasks.

Common misconception

“RAG is always more accurate than using the model's training knowledge.”

RAG retrieval accuracy depends on chunking quality, embedding model quality, and query clarity. An ambiguous query may retrieve irrelevant documents, causing the model to generate an incorrect answer grounded in wrong content. A confident, well-supported answer from the model's training knowledge may be more accurate than an answer grounded in a poorly retrieved chunk. RAG improves accuracy for specific, up-to-date domain knowledge; it does not universally outperform parametric knowledge. Evaluate retrieval quality separately from generation quality.

8.5 Check your understanding

You are building a legal document assistant for a law firm with 50,000 case history documents. A user asks about a specific case from four years ago. Which memory type is most appropriate for retrieving the relevant documents?

A user asks 'what happened with the Johnson case?' and the agent retrieves a document about 'Johnson and Johnson v FDA'. This is not the right case. What is the most likely architectural cause?

Given the Samsung ChatGPT incident, which architectural change would have allowed engineers to get AI assistance with proprietary code while keeping that code inside the organisation?

Key takeaways

Agent memory has three types: in-context (immediate, temporary, limited), external (persistent, structured, exact-match), and semantic (persistent, meaning-based, approximate). Match the type to the access pattern.
Embeddings convert text to vectors. Similar meanings produce close vectors. This enables semantic search over large corpora where exact-keyword matching fails.
RAG pipelines retrieve relevant documents before generation, grounding responses in specific knowledge. Chunking quality is the most common failure point: poor chunks split relevant information across results.
Conversation history strategies manage context window limits: full history is simplest, sliding window is cheapest, summarisation preserves more information at the cost of implementation complexity.
Data governance applies to agent memory. Any agent storing conversation history or user data across sessions must comply with GDPR and UK data protection law: inform users, set retention periods, support deletion on request.
Memory retrieval latency directly affects agent response time. A vector store query adding 200ms per turn is acceptable for chat but unacceptable for real-time automation. Profile your retrieval path before committing to an architecture.

Standards and sources cited in this module

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
NeurIPS 2020
Foundational RAG paper establishing the architecture of combining parametric and non-parametric memory. Cited in Section 8.2.
Chroma documentation
docs.trychroma.com
Open-source vector database; the simplest starting point for local RAG development. Referenced in Section 8.3 as the recommended tool for local development.
Pinecone documentation
docs.pinecone.io, RAG best practices
Production-scale vector database. Quoted in Section 8.3 for its guidance on chunking strategy quality as the primary determinant of RAG accuracy.
OpenAI Embeddings guide
platform.openai.com/docs/guides/embeddings
Practical reference for embedding models, dimension selection, and use cases. Cited in Section 8.2 for the embedding model comparison.
ISO/IEC 42001:2023, Artificial Intelligence Management Systems
Clause 8.4, Data lifecycle management
The ISO standard for AI management systems. Referenced in Section 8.4 for data governance and retention requirements for agents that store user data.

Previous: Tools and Actions Next: Design Patterns

Module 8 of 25 · Core Concepts

Loading lesson...

MODULE 8 OF 5 · CORE CONCEPTS

Memory and Context

45 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

Distinguish between in-context, external, and semantic memory in agents
Explain how vector embeddings enable semantic search over large document collections
Describe the stages of a retrieval-augmented generation (RAG) pipeline
Choose a memory strategy based on latency, cost, and access pattern requirements

Real-world incident · April 2023

Samsung engineers pasted proprietary chip source code into ChatGPT. Three times.

With the learning outcomes established, this module begins by examining three kinds of agent memory in depth.

8.1 Three kinds of agent memory

With an understanding of three kinds of agent memory in place, the discussion can now turn to embeddings and semantic search, which builds directly on these foundations.

8.2 Embeddings and semantic search

“We propose retrieval-augmented generation (RAG) for knowledge-intensive NLP tasks. RAG models combine pre-trained parametric and non-parametric memory for language generation.”
Lewis, P. et al., 2020 - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020
This is the foundational paper establishing RAG as an architecture. The key insight is that LLMs have two kinds of knowledge: parametric (baked into weights during training) and non-parametric (retrieved at inference time). RAG combines both, letting a model give accurate, current answers by retrieving specific documents rather than relying on potentially outdated training data.

With an understanding of embeddings and semantic search in place, the discussion can now turn to the rag pipeline in stages, which builds directly on these foundations.

8.3 The RAG pipeline in stages

A retrieval-augmented generation (RAG) pipeline has two distinct phases: ingestion and retrieval. Ingestion runs once (and again whenever the document corpus changes). Retrieval runs on every query.

“The key to effective RAG is not the retrieval itself but the quality of the chunks. Poor chunking strategies cause the most relevant information to span a chunk boundary and be split between two results.”
Pinecone documentation, 2024 - docs.pinecone.io, RAG best practices: chunking strategies
Chunking is where most RAG pipelines fail in practice. If a policy paragraph is split mid-sentence, neither chunk contains the complete relevant information. Overlapping chunks (where the end of one chunk repeats the beginning of the next) mitigate this. Semantic chunking, which splits at natural semantic boundaries rather than at fixed token counts, produces the best retrieval accuracy but is more expensive to implement.

With an understanding of the rag pipeline in stages in place, the discussion can now turn to conversation history strategies, which builds directly on these foundations.

8.4 Conversation history strategies

Common misconception

“A larger context window eliminates the need for external memory.”

Common misconception

“RAG is always more accurate than using the model's training knowledge.”

8.5 Check your understanding

A user asks 'what happened with the Johnson case?' and the agent retrieves a document about 'Johnson and Johnson v FDA'. This is not the right case. What is the most likely architectural cause?

Given the Samsung ChatGPT incident, which architectural change would have allowed engineers to get AI assistance with proprietary code while keeping that code inside the organisation?

Key takeaways

Agent memory has three types: in-context (immediate, temporary, limited), external (persistent, structured, exact-match), and semantic (persistent, meaning-based, approximate). Match the type to the access pattern.
Embeddings convert text to vectors. Similar meanings produce close vectors. This enables semantic search over large corpora where exact-keyword matching fails.
RAG pipelines retrieve relevant documents before generation, grounding responses in specific knowledge. Chunking quality is the most common failure point: poor chunks split relevant information across results.
Conversation history strategies manage context window limits: full history is simplest, sliding window is cheapest, summarisation preserves more information at the cost of implementation complexity.
Data governance applies to agent memory. Any agent storing conversation history or user data across sessions must comply with GDPR and UK data protection law: inform users, set retention periods, support deletion on request.
Memory retrieval latency directly affects agent response time. A vector store query adding 200ms per turn is acceptable for chat but unacceptable for real-time automation. Profile your retrieval path before committing to an architecture.

Standards and sources cited in this module

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
NeurIPS 2020
Foundational RAG paper establishing the architecture of combining parametric and non-parametric memory. Cited in Section 8.2.
Chroma documentation
docs.trychroma.com
Open-source vector database; the simplest starting point for local RAG development. Referenced in Section 8.3 as the recommended tool for local development.
Pinecone documentation
docs.pinecone.io, RAG best practices
Production-scale vector database. Quoted in Section 8.3 for its guidance on chunking strategy quality as the primary determinant of RAG accuracy.
OpenAI Embeddings guide
platform.openai.com/docs/guides/embeddings
Practical reference for embedding models, dimension selection, and use cases. Cited in Section 8.2 for the embedding model comparison.
ISO/IEC 42001:2023, Artificial Intelligence Management Systems
Clause 8.4, Data lifecycle management
The ISO standard for AI management systems. Referenced in Section 8.4 for data governance and retention requirements for agents that store user data.

Previous: Tools and Actions Next: Design Patterns

Module 8 of 25 · Core Concepts