Data Foundations · Module 4
Data representation and formats
Computers store everything using bits (binary digits) because hardware can reliably tell two states apart.
Previously
Units, notation, and the difference between percent and probability
Data work goes wrong when people are casual about units.
This module
Data representation and formats
Computers store everything using bits (binary digits) because hardware can reliably tell two states apart.
Next
Standards, schemas, and interoperability
Interoperability is a boring word for a very expensive problem.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
If any layer is unclear, teams will disagree while using the same data.
What you will be able to do
- 1 Explain data representation and formats in your own words and apply it to a realistic scenario.
- 2 If you separate meaning from storage, you make data easier to share and harder to misread.
- 3 Check the assumption "Meaning is written down" and explain what changes if it is false.
- 4 Check the assumption "Encoding and format are explicit" and explain what changes if it is false.
Before you begin
- No previous technical background required
- Read the section explanation before using tools
Common ways people get this wrong
- Schema drift. A field changes meaning over time. Dashboards keep working but the story becomes wrong.
- Exports that lose meaning. A spreadsheet export without definitions turns data back into guesswork. The numbers travel, the meaning does not.
Main idea at a glance
Diagram
Stage 1
Contextual
Defines why the data exists and what it is for. Who relies on it and what outcomes does it support?
I think this step gets skipped too often. If you cannot answer why this data exists, the rest is theatre.
Computers store everything using bits (binary digits) because hardware can reliably tell two states apart. A byte is eight bits, which can represent 256 distinct values. Encoding maps symbols to numbers, while a file format adds structure on top. CSV is plain text with commas, JSON wraps name value pairs, XML uses nested tags, images store grids of pixels, and audio stores wave samples. The wrong format or encoding breaks systems because the receiver cannot parse what was intended.
Think of representation in four layers. Each layer must stay consistent or the meaning collapses.
Four representation layers
If any layer is unclear, teams will disagree while using the same data.
-
Contextual layer
Defines scope and purpose, including who relies on the data and why it matters.
-
Conceptual or semantic layer
Defines what the data represents, such as a temperature reading and its unit.
-
Logical layer
Defines structure and schema, including fields, types, and allowed ranges.
-
Physical layer
Defines storage form, such as JSON in files or rows in a database.
A byte can represent 0 to 255. Powers of two help size things: means three binary places can represent eight values. Plain English: two multiplied by itself three times equals eight. Binary choices stack quickly.
| Item | Meaning | | --- | --- | | Bit | Smallest unit, either 0 or 1 | | Byte | 8 bits, often one character in simple encodings | | | Number of combinations with n bits |
Worked example. The same text, different bytes
Worked example. The same text, different bytes
Here is a simple truth that causes surprising damage in real systems: the same characters can be stored as different bytes depending on encoding. If one system writes text as UTF-8 and another reads it as something else, the data is not “slightly wrong”. It is wrong.
My opinion: if your system depends on humans “remembering” encodings, it is already broken. It should be explicit in the interface contract and tested like any other behaviour.
Common mistakes (and how to avoid them)
Common mistake
Assuming text is text
Reality: Text without an explicit encoding is an accident waiting to happen. Always specify UTF-8 in file exports and API contracts.
Common mistake
Commas inside CSV fields without quoting
Reality: A name like "Smith, John" will break your parser if fields are not properly quoted. Use a real CSV library, not string splitting.
Common mistake
Treating JSON property order as meaningful
Reality: JSON objects are unordered by specification. If your parser depends on property order, it will break when a different serialiser produces the same valid JSON.
Common mistake
Storing identifiers as numbers
Reality: Postcodes, meter IDs, and account numbers often have leading zeros. Store them as strings. If it is an identifier and not a quantity you do maths on, it should be text.
Verification. How you know you understood it
Representation verification drill
Use this to confirm the concept is clear, not memorised.
-
Explain symbol, number, and bits
Use the tool to explain the difference between `"A"`, `65`, and `01000001`.
-
Write one format contract
Pick a real format and list delimiters, quoting rules, schema expectations, and metadata.
-
Explain binary in plain English
Write one paragraph on why binary is representation, not meaning.
Maths ladder (optional). From intuition to rigour
Maths ladder (optional). From intuition to rigour
You can learn data without advanced maths, but you cannot become an expert without eventually becoming comfortable with symbols. The goal here is not to show off. It is to make the symbols friendly and precise.
Foundations. Powers of two and counting possibilities
If a system has n bits, each bit has two possible states (0 or 1). The total number of possible bit patterns is:
: number of bits (an integer)
: number of distinct patterns (how many different values you can represent)
Example: (one byte). Then . So a byte can represent 256 distinct values, typically 0 to 255.
Next step. Base conversion and why it matters for data
A binary number is a sum of powers of two. If you see , the 1s mark which powers are included:
That equals . Why it matters: when data gets corrupted at the byte level (bad encoding, wrong parsing, truncation), the meaning upstream is gone. You cannot “fix it later” reliably because you do not know what the original bits were meant to represent.
Deeper. Information content (intuition)
The less predictable something is, the more information it carries. If a value is always the same, it carries no surprise. A common formal measure is entropy. In the simplest discrete case:
: a random variable (the thing that can take different values)
: a particular value of
: probability that
: entropy in bits
Example: a fair coin has , . Then bit. A biased coin has less. Why it matters in data: highly predictable fields can still be important (for joining and identifiers), but they often carry little information for modelling. This is one reason “more columns” is not the same as “more value”.
Mental model
Four layers that keep meaning intact
If you separate meaning from storage, you make data easier to share and harder to misread.
-
1
Context (scope and purpose)
-
2
Meaning (what it represents)
-
3
Schema (fields and rules)
-
4
Storage (files and tables)
Assumptions to keep in mind
- Meaning is written down. Units, time zones, identifiers, and definitions should live with the data, not in somebody’s memory.
- Encoding and format are explicit. If one system writes UTF‑8 and another assumes a legacy encoding, text corruption looks like a mystery but it is just missing agreement.
Failure modes to notice
- Schema drift. A field changes meaning over time. Dashboards keep working but the story becomes wrong.
- Exports that lose meaning. A spreadsheet export without definitions turns data back into guesswork. The numbers travel, the meaning does not.
Check yourself
Quick check. Representation and formats
0 of 7 opened
What is a bit
The smallest binary digit, 0 or 1.
What does encoding do
Maps symbols to numbers so systems can store and transmit meaning.
Scenario. A colleague opens a CSV and names look corrupted (odd symbols). What is the likely cause
An encoding mismatch. The file was saved with one encoding but opened as another (for example UTF‑8 vs a legacy encoding).
What is CSV
Plain text data separated by commas.
Scenario. When would you pick JSON over CSV
When you need nested structure (objects inside objects) or explicit field names that travel with the data.
Scenario. A dataset has leading zeros in IDs but Excel keeps removing them. What should you do
Treat IDs as text, not numbers, and use a schema or import settings that preserve formatting. This is a representation choice, not a math problem.
Why does binary suit computers
Hardware can reliably distinguish two states, which makes storage and error handling easier.
Artefact and reflection
Artefact
A short module note with one key definition and one practical example
Reflection
Where in your work would explain data representation and formats in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?
Optional practice
Type text and see characters turn into numbers and bits.