Data Foundations · Module 4

Data representation and formats

Computers store everything using bits (binary digits) because hardware can reliably tell two states apart.

22 min 4 outcomes Data Foundations

Previously

Units, notation, and the difference between percent and probability

Data work goes wrong when people are casual about units.

This module

Data representation and formats

Computers store everything using bits (binary digits) because hardware can reliably tell two states apart.

Next

Standards, schemas, and interoperability

Interoperability is a boring word for a very expensive problem.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

If any layer is unclear, teams will disagree while using the same data.

What you will be able to do

  • 1 Explain data representation and formats in your own words and apply it to a realistic scenario.
  • 2 If you separate meaning from storage, you make data easier to share and harder to misread.
  • 3 Check the assumption "Meaning is written down" and explain what changes if it is false.
  • 4 Check the assumption "Encoding and format are explicit" and explain what changes if it is false.

Before you begin

  • No previous technical background required
  • Read the section explanation before using tools

Common ways people get this wrong

  • Schema drift. A field changes meaning over time. Dashboards keep working but the story becomes wrong.
  • Exports that lose meaning. A spreadsheet export without definitions turns data back into guesswork. The numbers travel, the meaning does not.

Main idea at a glance

Diagram

Stage 1

Contextual

Defines why the data exists and what it is for. Who relies on it and what outcomes does it support?

I think this step gets skipped too often. If you cannot answer why this data exists, the rest is theatre.

Computers store everything using bits (binary digits) because hardware can reliably tell two states apart. A byte is eight bits, which can represent 256 distinct values. Encoding maps symbols to numbers, while a file format adds structure on top. CSV is plain text with commas, JSON wraps name value pairs, XML uses nested tags, images store grids of pixels, and audio stores wave samples. The wrong format or encoding breaks systems because the receiver cannot parse what was intended.

Think of representation in four layers. Each layer must stay consistent or the meaning collapses.

Four representation layers

If any layer is unclear, teams will disagree while using the same data.

  1. Contextual layer

    Defines scope and purpose, including who relies on the data and why it matters.

  2. Conceptual or semantic layer

    Defines what the data represents, such as a temperature reading and its unit.

  3. Logical layer

    Defines structure and schema, including fields, types, and allowed ranges.

  4. Physical layer

    Defines storage form, such as JSON in files or rows in a database.

A byte can represent 0 to 255. Powers of two help size things: means three binary places can represent eight values. Plain English: two multiplied by itself three times equals eight. Binary choices stack quickly.

| Item | Meaning | | --- | --- | | Bit | Smallest unit, either 0 or 1 | | Byte | 8 bits, often one character in simple encodings | | | Number of combinations with n bits |

Worked example. The same text, different bytes

Worked example. The same text, different bytes

Here is a simple truth that causes surprising damage in real systems: the same characters can be stored as different bytes depending on encoding. If one system writes text as UTF-8 and another reads it as something else, the data is not “slightly wrong”. It is wrong.

My opinion: if your system depends on humans “remembering” encodings, it is already broken. It should be explicit in the interface contract and tested like any other behaviour.

Common mistakes (and how to avoid them)

Common mistake

Assuming text is text

Reality: Text without an explicit encoding is an accident waiting to happen. Always specify UTF-8 in file exports and API contracts.

Common mistake

Commas inside CSV fields without quoting

Reality: A name like "Smith, John" will break your parser if fields are not properly quoted. Use a real CSV library, not string splitting.

Common mistake

Treating JSON property order as meaningful

Reality: JSON objects are unordered by specification. If your parser depends on property order, it will break when a different serialiser produces the same valid JSON.

Common mistake

Storing identifiers as numbers

Reality: Postcodes, meter IDs, and account numbers often have leading zeros. Store them as strings. If it is an identifier and not a quantity you do maths on, it should be text.

Verification. How you know you understood it

Representation verification drill

Use this to confirm the concept is clear, not memorised.

  1. Explain symbol, number, and bits

    Use the tool to explain the difference between `"A"`, `65`, and `01000001`.

  2. Write one format contract

    Pick a real format and list delimiters, quoting rules, schema expectations, and metadata.

  3. Explain binary in plain English

    Write one paragraph on why binary is representation, not meaning.

Maths ladder (optional). From intuition to rigour

Maths ladder (optional). From intuition to rigour

You can learn data without advanced maths, but you cannot become an expert without eventually becoming comfortable with symbols. The goal here is not to show off. It is to make the symbols friendly and precise.

Foundations. Powers of two and counting possibilities

If a system has n bits, each bit has two possible states (0 or 1). The total number of possible bit patterns is:

  • : number of bits (an integer)

  • : number of distinct patterns (how many different values you can represent)

Example: (one byte). Then . So a byte can represent 256 distinct values, typically 0 to 255.

Next step. Base conversion and why it matters for data

A binary number is a sum of powers of two. If you see , the 1s mark which powers are included:

That equals . Why it matters: when data gets corrupted at the byte level (bad encoding, wrong parsing, truncation), the meaning upstream is gone. You cannot “fix it later” reliably because you do not know what the original bits were meant to represent.

Deeper. Information content (intuition)

The less predictable something is, the more information it carries. If a value is always the same, it carries no surprise. A common formal measure is entropy. In the simplest discrete case:

  • : a random variable (the thing that can take different values)

  • : a particular value of

  • : probability that

  • : entropy in bits

Example: a fair coin has , . Then bit. A biased coin has less. Why it matters in data: highly predictable fields can still be important (for joining and identifiers), but they often carry little information for modelling. This is one reason “more columns” is not the same as “more value”.

Mental model

Four layers that keep meaning intact

If you separate meaning from storage, you make data easier to share and harder to misread.

  1. 1

    Context (scope and purpose)

  2. 2

    Meaning (what it represents)

  3. 3

    Schema (fields and rules)

  4. 4

    Storage (files and tables)

Assumptions to keep in mind

  • Meaning is written down. Units, time zones, identifiers, and definitions should live with the data, not in somebody’s memory.
  • Encoding and format are explicit. If one system writes UTF‑8 and another assumes a legacy encoding, text corruption looks like a mystery but it is just missing agreement.

Failure modes to notice

  • Schema drift. A field changes meaning over time. Dashboards keep working but the story becomes wrong.
  • Exports that lose meaning. A spreadsheet export without definitions turns data back into guesswork. The numbers travel, the meaning does not.

Check yourself

Quick check. Representation and formats

0 of 7 opened

What is a bit

The smallest binary digit, 0 or 1.

What does encoding do

Maps symbols to numbers so systems can store and transmit meaning.

Scenario. A colleague opens a CSV and names look corrupted (odd symbols). What is the likely cause

An encoding mismatch. The file was saved with one encoding but opened as another (for example UTF‑8 vs a legacy encoding).

What is CSV

Plain text data separated by commas.

Scenario. When would you pick JSON over CSV

When you need nested structure (objects inside objects) or explicit field names that travel with the data.

Scenario. A dataset has leading zeros in IDs but Excel keeps removing them. What should you do

Treat IDs as text, not numbers, and use a schema or import settings that preserve formatting. This is a representation choice, not a math problem.

Why does binary suit computers

Hardware can reliably distinguish two states, which makes storage and error handling easier.

Artefact and reflection

Artefact

A short module note with one key definition and one practical example

Reflection

Where in your work would explain data representation and formats in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?

Optional practice

Type text and see characters turn into numbers and bits.

Source DAMA DMBOK 2 (Data Management Body of Knowledge, 2nd Edition)
Source ISO/IEC 11179 metadata registries
Source ISO/IEC 27701:2025 privacy information management
Source ICO data protection principles and UK GDPR guidance