Module 18 of 26 · Applied

Data modelling basics

15 min read 3 outcomes Interactive + drag challenge 4 standards cited

By the end of this module you will be able to:

Create conceptual and logical data models
Apply normalisation rules to reduce redundancy
Choose between relational and document models for different use cases

Whiteboard diagram showing entity relationships, the foundation of data modelling

Real-world problem · ongoing

A retailer stored customer addresses in 14 different tables. Updating one meant updating 14.

A UK retail chain stored customer address data in 14 separate tables across its order management, CRM, billing, loyalty, and marketing systems. Each system had its own copy of the address. When a customer moved house, each system needed a separate update.

Normalisation, the discipline of reducing redundancy in data models, would have stored the address once and linked it to each system by reference. The previous module covered experimental design. This module covers how to structure data so it stays consistent and efficient.

Every time a customer moved, 14 tables needed updating. Inevitably, some were missed. The result: conflicting addresses, returned mail, and compliance failures. How does data modelling prevent this?

Data modelling is the practice of designing how data is structured, stored, and related. A good model reduces redundancy, enforces consistency, and makes queries efficient. A bad model creates the 14-table address problem: data scattered across copies that inevitably diverge.

With the learning outcomes established, this module begins by examining three levels of data modelling in depth.

18.1 Three levels of data modelling

Data modelling operates at three levels, each serving a different audience:

Conceptual model: shows entities and their relationships in business terms. No technical detail. Audience: business stakeholders. Example: "A Customer places Orders. An Order contains Products."
Logical model: adds attributes, data types, and keys. Still database-independent. Audience: data architects. Example: Customer(customer_id PK, name, email), Order(order_id PK, customer_id FK, order_date).
Physical model: specifies the actual database implementation: table names, column types, indexes, partitioning, storage engine. Audience: database administrators.

“A data model is a communication tool. If it cannot be understood by both business users and engineers, it has failed its primary purpose.”
DAMA-DMBOK2 (2017) - Chapter 5, Data Modelling and Design
DAMA frames models as communication tools, not technical artefacts. The conceptual model must be readable by a finance director. The physical model must be implementable by a DBA. Both are data models; they serve different conversations.

Common misconception

“We use MongoDB so we do not need data modelling.”

Document databases still benefit from modelling. Deciding which data to embed in a document versus reference from another collection is a modelling decision. Without it, you end up with inconsistent nested structures, unbounded document growth, and queries that scan entire collections. The trade-offs differ from relational databases, but the need for design does not disappear.

With an understanding of three levels of data modelling in place, the discussion can now turn to normalisation, which builds directly on these foundations.

Entity relationship diagram printed on paper showing normalised database design — Data models structure how information is stored and related. Normalised models store each fact once; denormalised models optimise for read performance at the cost of redundancy.

18.2 Normalisation

Normalisation is the process of organising data to reduce redundancy and dependency. Three normal forms cover most practical needs:

First Normal Form (1NF): every cell contains a single value (no lists, no repeating groups). A cell containing "Red, Blue, Green" violates 1NF.
Second Normal Form (2NF): all non-key attributes depend on the entire primary key, not just part of it. In a table with a composite key (order_id, product_id), the product_name depends only on product_id, not on the full key. Move it to a separate Products table.
Third Normal Form (3NF): no non-key attribute depends on another non-key attribute. If postcode determines city, then city depends on postcode, not on the primary key. Move city to an Addresses table keyed by postcode.

“Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.”
Bill Kent, 'A Simple Guide to Five Normal Forms in Relational Database Theory' (1983) - Summary of 3NF
Kent's mnemonic (adapted to 'so help me Codd' in tribute to E.F. Codd, the inventor of relational databases) captures the essence of 3NF in one sentence. If an attribute provides a fact about something other than the primary key, it belongs in a different table.

Common misconception

“Normalisation should always be applied to the maximum degree.”

Over-normalisation creates excessive joins that slow query performance. Analytics workloads often benefit from deliberate denormalisation (star schemas, wide tables) that trades storage space for query speed. The right level of normalisation depends on the workload: transactional systems (many writes) benefit from higher normalisation; analytical systems (many reads, few writes) benefit from denormalisation.

Database table structures shown in a design tool, representing normalisation and schema design — Normalisation reduces redundancy in transactional databases. Analytics warehouses often deliberately denormalise for query performance, accepting storage cost for speed.

Loading interactive component...

18.3 Check your understanding

A table has columns: order_id, product_id, product_name, product_category, order_date. The primary key is (order_id, product_id). product_name depends only on product_id. Which normal form is violated?

An e-commerce company considers using MongoDB (document database) for its product catalogue. Products have varying attributes: electronics have voltage and warranty fields; clothing has size and material. Which model is more suitable?

A data warehouse team denormalises a star schema by embedding dimension attributes directly into the fact table, creating a wide table with 200 columns. What are the trade-offs?

Loading interactive component...

Key takeaways

Data modelling operates at three levels: conceptual (entities and relationships for business), logical (attributes and keys for architects), and physical (implementation details for DBAs).
Normalisation (1NF, 2NF, 3NF) reduces redundancy by ensuring each fact is stored once. The Kent mnemonic: every attribute describes 'the key, the whole key, and nothing but the key.'
Document databases (MongoDB, DynamoDB) suit data with varying attributes per record (product catalogues, user profiles). Relational databases suit data with consistent structure and complex relationships.
Analytics workloads benefit from deliberate denormalisation (star schemas, wide tables) that trades storage for query speed. The right normalisation level depends on the workload pattern.

Standards and sources cited in this module

DAMA-DMBOK2 (2017)
Chapter 5, Data Modelling and Design
Industry standard guidance on conceptual, logical, and physical data modelling.
Kent, W. (1983). 'A Simple Guide to Five Normal Forms in Relational Database Theory'
Full paper
Source for the 'key, whole key, nothing but the key' mnemonic and accessible normalisation explanation.
Codd, E.F. (1970). 'A Relational Model of Data for Large Shared Data Banks', Communications of the ACM
Full paper
Foundational paper establishing the relational model. Every normalisation concept traces back to Codd's original work.
MongoDB Documentation, 'Data Modeling Introduction' (2024)
Full guide
Document modelling patterns including embedding vs referencing decisions.

Back: Inference and experiments Next: Data as a product

Module 18 of 26 · Applied Data