Module 18 of 26 · Applied

Data modelling basics

15 min read 3 outcomes Interactive + drag challenge 4 standards cited

By the end of this module you will be able to:

  • Create conceptual and logical data models
  • Apply normalisation rules to reduce redundancy
  • Choose between relational and document models for different use cases

Conceptual, logical, and physical models for three audiences

Data modelling has three levels for three audiences: conceptual, logical, physical.

Conceptual, logical, physical models for three audiences Three cards left to right: Conceptual (entities + relationships, business audience), Logical (attributes + types, architect audience, emphasised), Physical (storage + indexes, DBA audience). Verb arrows decomposes into, implemented as. A red-accent callout names skipping the logical layer as the conceptual-to-physical gap defect. MODELLING LEVELS · CONCEPTUAL -> LOGICAL -> PHYSICAL 1DMBOK 2 §5ConceptualEntities + relationships2DMBOK 2 §5LogicalAttributes + types + keys3ISO 9075PhysicalStorage + indexes + partitions decomposes intoimplemented as Skipping logical leaves a translation gap Conceptual diagrams pasted into the database create overlapping tables and lost relationships. The logical layer is the bridge. ransfordsnotes.com

Data modelling is three levels with three audiences: conceptual (business reviewers), logical (architects), physical (DBA + platform). Each level adds detail at the cost of audience scope. DAMA-DMBOK 2 Chapter 5 names the layers; skipping a layer loses the bridge between business and storage.

Normalisation trade-off across forms and workloads

Normalisation level is a trade-off against workload; OLTP wins from 3NF and OLAP wins from star.

Normalisation trade-off across forms and workloads Five cards left to right: 1NF, 2NF, 3NF (emphasised), BCNF, Denormalised star. Verb arrows step through tightening then loosening. A red-accent callout names the workload-driven choice. NORMALISATION TRADE-OFF · CODD 1971 -> KIMBALL DENORMAL 1NFCodd 1971Atomic columnsNo repeating groups2NFCodd 1971No partial depsOn full key3NFCodd 1971No transitive depsAnomaly-freeBCNFCoddEvery determinantIs a candidate keySTARKimballDenormalised starWide facts + dims +++OR Workload picks the level, not aesthetics OLTP wins from 3NF; OLAP wins from star. Mixing is a database that performs neither well. ransfordsnotes.com

Normalisation is a trade-off: 3NF reduces anomalies but explodes joins for analytics; denormalised stars cut join cost at the price of update anomalies. The right level depends on the workload, not on aesthetics. Codd 1971 defined the normal forms; Kimball + Inmon named the analytics counter.

Deterministic Data course visual for Data modelling basics

Real-world problem · ongoing

A retailer stored customer addresses in 14 different tables. Updating one meant updating 14.

A UK retail chain stored customer address data in 14 separate tables across its order management, CRM, billing, loyalty, and marketing systems. Each system had its own copy of the address. When a customer moved house, each system needed a separate update.

Normalisation, the discipline of reducing redundancy in data models, would have stored the address once and linked it to each system by reference. The previous module covered experimental design. This module covers how to structure data so it stays consistent and efficient.

Every time a customer moved, 14 tables needed updating. Inevitably, some were missed. The result: conflicting addresses, returned mail, and compliance failures. How does data modelling prevent this?

Data modelling is the practice of designing how data is structured, stored, and related. A good model reduces redundancy, enforces consistency, and makes queries efficient. A bad model creates the 14-table address problem: data scattered across copies that inevitably diverge.

With the learning outcomes established, this module begins by examining three levels of data modelling in depth.

18.1 Three levels of data modelling

Data modelling operates at three levels, each serving a different audience:

  1. Conceptual model: shows entities and their relationships in business terms. No technical detail. Audience: business stakeholders. Example: "A Customer places Orders. An Order contains Products."
  2. Logical model: adds attributes, data types, and keys. Still database-independent. Audience: data architects. Example: Customer(customer_id PK, name, email), Order(order_id PK, customer_id FK, order_date).
  3. Physical model: specifies the actual database implementation: table names, column types, indexes, partitioning, storage engine. Audience: database administrators.

A data model is a communication tool. If it cannot be understood by both business users and engineers, it has failed its primary purpose.

DAMA-DMBOK2 (2017) - Chapter 5, Data Modelling and Design

DAMA frames models as communication tools, not technical artefacts. The conceptual model must be readable by a finance director. The physical model must be implementable by a DBA. Both are data models; they serve different conversations.

Common misconception

We use MongoDB so we do not need data modelling.

Document databases still benefit from modelling. Deciding which data to embed in a document versus reference from another collection is a modelling decision. Without it, you end up with inconsistent nested structures, unbounded document growth, and queries that scan entire collections. The trade-offs differ from relational databases, but the need for design does not disappear.

With an understanding of three levels of data modelling in place, the discussion can now turn to normalisation, which builds directly on these foundations.

18.2 Normalisation

Normalisation is the process of organising data to reduce redundancy and dependency. Three normal forms cover most practical needs:

  1. First Normal Form (1NF): every cell contains a single value (no lists, no repeating groups). A cell containing "Red, Blue, Green" violates 1NF.
  2. Second Normal Form (2NF): all non-key attributes depend on the entire primary key, not just part of it. In a table with a composite key (order_id, product_id), the product_name depends only on product_id, not on the full key. Move it to a separate Products table.
  3. Third Normal Form (3NF): no non-key attribute depends on another non-key attribute. If postcode determines city, then city depends on postcode, not on the primary key. Move city to an Addresses table keyed by postcode.

Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.

Bill Kent, 'A Simple Guide to Five Normal Forms in Relational Database Theory' (1983) - Summary of 3NF

Kent's mnemonic (adapted to 'so help me Codd' in tribute to E.F. Codd, the inventor of relational databases) captures the essence of 3NF in one sentence. If an attribute provides a fact about something other than the primary key, it belongs in a different table.

Common misconception

Normalisation should always be applied to the maximum degree.

Over-normalisation creates excessive joins that slow query performance. Analytics workloads often benefit from deliberate denormalisation (star schemas, wide tables) that trades storage space for query speed. The right level of normalisation depends on the workload: transactional systems (many writes) benefit from higher normalisation; analytical systems (many reads, few writes) benefit from denormalisation.

Loading interactive component...
18.3 Check your understanding

A table has columns: order_id, product_id, product_name, product_category, order_date. The primary key is (order_id, product_id). product_name depends only on product_id. Which normal form is violated?

An e-commerce company considers using MongoDB (document database) for its product catalogue. Products have varying attributes: electronics have voltage and warranty fields; clothing has size and material. Which model is more suitable?

A data warehouse team denormalises a star schema by embedding dimension attributes directly into the fact table, creating a wide table with 200 columns. What are the trade-offs?

Loading interactive component...

Key takeaways

  • Data modelling operates at three levels: conceptual (entities and relationships for business), logical (attributes and keys for architects), and physical (implementation details for DBAs).
  • Normalisation (1NF, 2NF, 3NF) reduces redundancy by ensuring each fact is stored once. The Kent mnemonic: every attribute describes 'the key, the whole key, and nothing but the key.'
  • Document databases (MongoDB, DynamoDB) suit data with varying attributes per record (product catalogues, user profiles). Relational databases suit data with consistent structure and complex relationships.
  • Analytics workloads benefit from deliberate denormalisation (star schemas, wide tables) that trades storage for query speed. The right normalisation level depends on the workload pattern.

Standards and sources cited in this module

  1. DAMA-DMBOK2 (2017)

    Chapter 5, Data Modelling and Design

    Industry standard guidance on conceptual, logical, and physical data modelling.

  2. Kent, W. (1983). 'A Simple Guide to Five Normal Forms in Relational Database Theory'

    Full paper

    Source for the 'key, whole key, nothing but the key' mnemonic and accessible normalisation explanation.

  3. Codd, E.F. (1970). 'A Relational Model of Data for Large Shared Data Banks', Communications of the ACM

    Full paper

    Foundational paper establishing the relational model. Every normalisation concept traces back to Codd's original work.

  4. MongoDB Documentation, 'Data Modeling Introduction' (2024)

    Full guide

    Document modelling patterns including embedding vs referencing decisions.

Module 18 of 26 · Applied Data