Module 3 of 26

Units, notation, and binary basics

Bits, bytes, binary counting, hexadecimal, endianness, and why unit precision matters in every data system from storage allocation to network protocols.

By the end of this module you will be able to:

  • Convert between bits, bytes, and common storage prefixes correctly
  • Distinguish IEC binary prefixes (KiB, MiB, GiB) from SI decimal prefixes (KB, MB, GB)
  • Read a 4-bit binary number and express a byte value in hexadecimal
  • Explain why unit ambiguity causes real data engineering problems

Bits, bytes, and storage prefixes

A bit is the smallest unit of digital information, representing one of two possible states: 0 or 1. A byte is a group of 8 bits. It is the standard addressable unit of memory in most computer architectures. A single byte can represent 256 distinct values (2^8 = 256), sufficient to encode one character in legacy character sets such as ASCII.

Two incompatible prefix systems are in use, and they conflict:

  • SI (decimal) prefixes: 1 kilobyte (KB) = 1,000 bytes. Hard drive manufacturers use this system.
  • IEC (binary) prefixes, standardised in IEC 80000-13:2008: 1 kibibyte (KiB) = 1,024 bytes (2^10). Operating systems historically used this system while labelling units as "KB," which caused the confusion.

The gap widens at larger scales. A "1 TB" hard drive contains 1,000,000,000,000 bytes (SI). Windows, which uses binary calculations, reports this as approximately 931 GiB but displays "GiB" as "GB," making the drive appear smaller than advertised. A data pipeline that expects MB (SI) but receives MiB (IEC) values will underestimate storage requirements by approximately 4.8% per step. Across petabyte-scale operations, this becomes significant.

With an understanding of bits, bytes, and storage prefixes in place, the discussion can now turn to binary, hexadecimal, and endianness, which builds directly on these foundations.

Binary, hexadecimal, and endianness

Computers use base-2 (binary) arithmetic because electronic circuits reliably represent two states. In binary, digit positions represent powers of 2. Binary 0101 = 0 + 4 + 0 + 1 = 5 in decimal. Binary 1011 = 8 + 0 + 2 + 1 = 11. A 4-bit number can represent values from 0000 (0) to 1111 (15), giving 16 possible values (2^4). An 8-bit byte gives 256 values.

Hexadecimal (base-16) uses digits 0-9 and letters A-F. Each hex digit represents exactly 4 bits; two hex digits represent one byte. The web colour #FF5733 encodes RGB (Red 255, Green 87, Blue 51). MAC addresses appear as six hex pairs such as A4:C3:F0:85:AC:2D. IPv6 addresses use eight groups of four hex digits. Memory addresses in debuggers are expressed in hex.

Endianness is the order in which bytes are stored for multi-byte values. Big-endian stores the most significant byte first (used by network protocols, also called "network byte order"). Little-endian stores the least significant byte first (used by x86 processors: Intel, AMD). The 32-bit integer 0x12345678 is stored as "12 34 56 78" in big-endian and "78 56 34 12" in little-endian. Mismatches between systems exchanging binary data without agreeing on byte order cause silent data corruption.

The names and symbols for binary multiples shall be formed by attaching the appropriate prefix symbol to the symbol 'B' for byte. The binary prefixes are: kibi (Ki), mebi (Mi), gibi (Gi), tebi (Ti). These are distinct from the SI prefixes kilo, mega, giga, tera.

IEC 80000-13:2008, Quantities and units for information science and technology

Common misconception

All systems agree on what 1 KB or 1 GB means, so I don't need to specify the unit system.

In practice, network equipment, storage hardware, operating systems, and programming languages all have different defaults. Hard drive manufacturers use SI (1 GB = 1,000,000,000 bytes); operating systems historically used IEC binary (1 GB = 1,073,741,824 bytes) while calling it 'GB'. The difference at 1 GB is approximately 7.4%. Never assume which prefix system a system is using. Always document the unit system explicitly in data schemas and pipeline specifications. The NASA Mars Climate Orbiter was destroyed because one team assumed SI and another assumed imperial; the same category of assumption error affects data pipelines every day.

Check your understanding

A data pipeline receives a file with the header 'size: 512' and the sending system uses IEC binary units (KiB); the receiving system allocates buffer space using SI decimal units (KB). Which statement is correct?

A web designer specifies the background colour as #1A2B3C. What is the decimal value of the red channel?

Key takeaways

  • A bit is a single binary digit (0 or 1); a byte is 8 bits, capable of 256 distinct values. Data is ultimately stored and transmitted as sequences of bits.
  • SI prefixes (KB, MB, GB) use powers of 10; IEC prefixes (KiB, MiB, GiB) use powers of 2. The gap grows significantly at larger scales: 1 TB (SI) vs 0.931 TiB (IEC). Always specify which system your pipeline uses.
  • Hexadecimal compresses binary into a readable form: each hex digit represents exactly 4 bits. Hex appears in colour codes, MAC addresses, memory addresses, and IPv6.
  • Endianness determines byte order in multi-byte values. Big-endian is used by network protocols; little-endian by x86 processors. Mismatches cause silent data corruption.
  • Unit ambiguity is an engineering risk at every scale. The NASA Mars Climate Orbiter loss ($327.6 million, 1999) is the canonical example of what unit mismatches cost when not caught.

Now that you can work with bits, bytes, and unit systems, the next module builds on that foundation by examining how data is represented in practical formats: character encodings, floating-point numbers, JSON, CSV, and compression. Format choices made at ingestion time follow data through its entire lifecycle.

Standards and sources cited in this module

  1. IEC 80000-13:2008: Quantities and units for information science

    Authoritative definition of KiB, MiB, GiB binary prefixes and their distinction from SI prefixes.

  2. BIPM SI Brochure 9th edition (2019): Prefixes

    SI prefix definitions: KB, MB, GB as powers of 10.

  3. NASA Mars Climate Orbiter Mishap Investigation Board Report, November 1999

    Root cause analysis: metric/imperial unit mismatch in thruster force data caused spacecraft destruction.