Module 4 of 26

Data representation and formats

How computers encode characters, numbers, images, and structured records, and why format and encoding choices have real consequences for data quality.

By the end of this module you will be able to:

  • Explain why UTF-8 dominates web encoding and how it differs from ASCII
  • Describe the IEEE 754 floating-point precision problem with a concrete financial example
  • Compare JSON, XML, and CSV for a given use case
  • Distinguish lossless from lossy compression and identify appropriate uses for each

Character encoding: ASCII, UTF-8, and Unicode

A character encoding is a mapping between characters (letters, digits, symbols) and their binary representations. ASCII (American Standard Code for Information Interchange, standardised 1963) encodes 128 characters using 7 bits: the 26 uppercase and 26 lowercase Latin letters, digits 0-9, punctuation, and 33 control characters. It cannot represent accented characters, non-Latin scripts, or even the pound sign (£).

UTF-8 (Unicode Transformation Format, 8-bit) is a variable-width encoding of the Unicode character set. It is backward-compatible with ASCII for the first 128 characters but uses 2 to 4 bytes for characters outside that range. The pound sign £ (Unicode U+00A3) encodes as two bytes in UTF-8 (0xC2 0xA3). In the legacy Windows-1252 encoding, it encodes as a single byte (0xA3). A file saved as Windows-1252 and read as UTF-8 produces garbled output because 0xA3 in UTF-8 is a continuation byte in a multi-byte sequence, not a standalone character.

As of 2023, over 98% of web pages use UTF-8 (W3C web crawl data). It is the default for HTML5 and required by RFC 8259 for JSON. Its backward-compatibility with ASCII meant it could be adopted gradually without breaking legacy content.

With an understanding of character encoding: ascii, utf-8, and unicode in place, the discussion can now turn to numbers: floating-point and structured data formats, which builds directly on these foundations.

Numbers: floating-point and structured data formats

Floating-point numbers use the IEEE 754-2019 standard. A 32-bit float uses 1 sign bit, 8 exponent bits, and 23 mantissa bits. Floating-point numbers cannot represent most decimal fractions exactly in binary: 0.1 is a repeating binary fraction. This means 0.1 + 0.2 in floating-point arithmetic does not equal exactly 0.3. For financial calculations, always use fixed-point decimal types (SQL DECIMAL, Python's decimal.Decimal), not floating-point.

Three formats dominate structured data exchange:

  • CSV: rows of comma-separated values, human-readable, universally supported. Weaknesses: no type information, no nesting, ambiguous handling of commas within values.
  • JSON: supports nesting, arrays, and typed values (strings, numbers, booleans, null). The dominant REST API format. Required by RFC 8259 to be UTF-8 encoded.
  • XML: supports schemas (XSD), namespaces, and complex document structures. More verbose than JSON. Prevalent in legacy enterprise systems, financial messaging (SWIFT, ISO 20022), and healthcare (HL7 v2/v3).

Apache Parquet and Apache Avro are binary formats optimised for large-scale analytics. Not human-readable, but offer far better compression and query performance than text formats. Parquet is columnar, enabling efficient queries that access only specific columns in large datasets.

With an understanding of numbers: floating-point and structured data formats in place, the discussion can now turn to compression: lossless and lossy, which builds directly on these foundations.

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. Implementations that generate JSON text are encouraged to use UTF-8 encoding.

RFC 8259 (IETF), Section 8.1: Character Encoding

Compression: lossless and lossy

Lossless compression allows perfect reconstruction of the original data from the compressed version. No information is discarded. Examples: DEFLATE (used in ZIP and PNG), LZ4, Zstandard. Appropriate for text, code, databases, and any data where integrity is essential.

Lossy compression permanently discards some information to achieve higher compression ratios. The original data cannot be fully reconstructed. Examples: JPEG (images), MP3 (audio), H.264 (video). Appropriate for media where small quality reductions are imperceptible to human senses.

A 24-megapixel RAW photograph might be 24 MB. As a high-quality JPEG it might be 4 MB (6:1 ratio) with no visible quality loss. As a low-quality JPEG it might be 400 KB (60:1 ratio) with visible artefacts. PNG lossless would be approximately 18 MB, preserving every pixel.

Never apply lossy compression to structured data. Applying JPEG compression to a CSV file would corrupt the data silently. Re-saving a JPEG introduces additional quality loss each time; this is why document archives should use lossless formats.

Common misconception

JSON is just text, so there are no encoding issues as long as the JSON syntax is valid.

RFC 8259 requires JSON to be encoded in UTF-8. Many real-world JSON producers emit Windows-1252 or ISO 8859-1 encoded files with a .json extension. Parsers that do not validate encoding before parsing will either fail on non-ASCII characters or produce incorrect character data silently. If a supplier sends JSON containing French accented characters (such as in city names) and it was saved in Windows-1252, your UTF-8 parser will raise an error or corrupt those characters. Always validate encoding before parsing, and require UTF-8 explicitly in API contracts.

Check your understanding

A financial system stores transaction amounts as 32-bit single-precision floats. A transaction for £99.99 is stored and then retrieved. The system displays £99.98999786376953. What is the root cause?

A data engineer receives a JSON file from a third-party supplier containing French city names with accented characters such as Nimes. The engineer's parser raises an error on several records. The file has a .json extension. What is the most likely cause?

Key takeaways

  • ASCII encodes 128 characters in 7 bits; UTF-8 extends this to the full Unicode character set using variable-width encoding and is backward-compatible with ASCII. Over 98% of web pages use UTF-8.
  • IEEE 754 floating-point cannot represent most decimal fractions exactly. Financial data must use fixed-point decimal types (SQL DECIMAL, Python's decimal.Decimal). The 0.1 + 0.2 != 0.3 problem is a direct consequence.
  • JSON, CSV, and XML have different trade-offs: CSV is compact and universal but lacks types and nesting; JSON supports nesting and is the dominant API format; XML is verbose but schema-capable, used in finance and healthcare.
  • Lossless compression (DEFLATE, LZ4) preserves data exactly. Lossy compression (JPEG, MP3) discards information permanently. Never apply lossy compression to structured data files.

Knowing how data is encoded and formatted prepares you for the next question: how do systems agree on what formats to use? The next module covers data standards and interoperability, from ISO 8601 dates to HL7 FHIR health records, and the real-world cost of getting it wrong.

Standards and sources cited in this module

  1. RFC 8259: The JavaScript Object Notation (JSON) Data Interchange Format

    Section 8.1: UTF-8 encoding requirement for JSON. The formal specification of JSON syntax.

  2. IEEE 754-2019: Standard for Floating-Point Arithmetic

    Floating-point representation format: single precision (32-bit) and double precision (64-bit) specifications.

  3. Unicode Standard 15.1 (2023)

    Chapter 2: General structure of Unicode encoding. UTF-8 encoding algorithm.

  4. W3C Web Almanac 2023: Encoding chapter

    UTF-8 adoption rate statistics across the web (98%+ coverage).