Data and Standards in Digital Systems
By the end of this module you will be able to:
- Explain why data standards are necessary for interoperability, and quantify the cost when they are absent
- Distinguish between open and proprietary formats, with UK government policy as context
- Select the correct API description standard for a given integration scenario: REST, event-driven, or graph

Open Banking UK
How standardised APIs transformed UK financial services
Before Open Banking, if a consumer wanted to see all their bank accounts in one app, each bank had to be approached individually for a bespoke data-sharing agreement. Some refused. Those that agreed used incompatible data formats. A fintech building a personal finance tool needed a different integration for each of the 50 largest UK banks.
The Payments Services Directive 2 (PSD2) and the Competition and Markets Authority Order in 2017 changed this by mandating a single standardised API specification that all large UK banks had to implement. The interoperability problem was solved not by the market but by regulatory standardisation. By 2023, 7 million UK consumers were using Open Banking-enabled services and over 60 regulated fintechs had built products on the standardised APIs.
The Open Banking case illustrates the core argument of this module: data standards are not a technical convenience. They are the precondition for digital ecosystems. Without a shared syntax and semantics, every organisation solving the same problem independently produces a fragmented, expensive, and incompatible landscape.
What role did regulatory standards play in making Open Banking work?
With the learning outcomes established, this module begins by examining why data standards matter in depth.
4.1 Why data standards matter
Interoperability is the ability of two or more systems to exchange information and use it without bespoke transformation work. Achieving it requires agreement on syntax (how the data is structured) and semantics (what the data means). Standards codify both.
Without standards, every integration between two systems requires custom mapping. The mathematics compound quickly: four systems without a shared standard require six bespoke integrations. Ten systems require 45. Twenty systems require 190. Each integration is also a maintenance liability. When either system changes, the mapping must be updated.
The NHS illustrates this at scale. UK hospitals use more than 900 different clinical information systems. A patient referred from a GP surgery to a hospital outpatient department may have their record held in systems that cannot directly exchange data. Clinicians re-enter information manually at each handover. SNOMED CT (Systematised Nomenclature of Medicine Clinical Terms) was mandated for NHS clinical records in 2019 to begin addressing this fragmentation. The standard gives every clinical concept a unique identifier, so a diagnosis code in one system means the same thing in another.
“Data standards are the foundation of interoperability. They enable different systems, organisations, and individuals to share and use data in a consistent and reliable way.”
UK Government Data Standards Authority - Data Standards Guidance, Section 1: Why data standards matter
The Data Standards Authority was established within the Central Digital and Data Office to break the pattern of government departments solving the same interoperability problems independently. Its mandate is to publish approved standards and retire the bespoke-mapping approach that generated the NHS fragmentation described above.
Standards enable interoperability at the level of meaning. The next section examines a related choice: open formats that any developer can implement, versus proprietary formats controlled by a single vendor.
4.2 Open formats versus proprietary formats
An open format is publicly specified, available without licence fees, and implementable by any developer or organisation. CSV (Comma-Separated Values), JSON (JavaScript Object Notation), and XML (Extensible Markup Language) are open formats. Any system can read and write them without paying a vendor.
A proprietary format is controlled by a single vendor. Data stored in a proprietary format cannot be read without that vendor's product or an incomplete reverse-engineered implementation. The PHE Excel incident is also a proprietary format story: government data being exchanged in a format whose row limits and silent truncation behaviour were not documented in any open specification.
The Cabinet Office mandated open standards for UK government software, data, and document formats in 2012. GDS built the GOV.UK design system using open web standards throughout. When published in 2016, any government service could adopt it without licence fees or vendor dependency. Over 300 services adopted it in the first two years.
“UK government must use open standards for software interoperability, data, and document formats. Open standards must be used in all future government systems.”
UK Government Open Standards Principles, Cabinet Office, 2012 - Principle 4: Use open standards
The policy followed a review that found significant lock-in costs in public sector IT arising from proprietary formats. The OOXML (Office Open XML) dispute - where Microsoft's format was ratified as an ISO standard but only fully implemented by Microsoft - illustrated that formal standardisation does not guarantee practical openness. The Cabinet Office chose to mandate formats that any developer can implement fully.

Common misconception
“CSV is always the safe, universal format for data exchange.”
CSV has no schema, no data types, and no character encoding standard. A CSV column labelled 'date' may contain ISO 8601 strings, UK date formats (DD/MM/YYYY), US date formats (MM/DD/YYYY), or Excel serial numbers, depending on which tool exported it. Two teams exchanging CSV without an agreed schema are exchanging ambiguous text. JSON Schema or OpenAPI 3.1 components add the type definition that CSV lacks, catching format errors at ingest rather than when an analyst questions a result.
Format governance covers how data is structured. API description standards govern how the interfaces that exchange that data are described and maintained.
4.3 API description standards
Three specifications cover the majority of API integration scenarios in UK organisations. The choice of standard is determined by the communication pattern (synchronous request-response, asynchronous event, or client-driven query), not by technical preference.
OpenAPI Specification 3.1 describes REST APIs. A REST API described in OAS 3.1 can generate client code, interactive documentation, and contract tests automatically. UK central government and the NHS API catalogue both require published APIs to provide an OAS description.
AsyncAPI 2.6 serves the same function for event-driven APIs. Systems that publish events to a message broker (Apache Kafka, RabbitMQ) rather than responding to HTTP requests use AsyncAPI to describe their channels, message schemas, and bindings.
GraphQL is a query language that allows clients to specify exactly the fields they need. GitHub's API v4 is GraphQL-based. GraphQL suits developer tooling and dashboards with diverse data requirements; it is not a replacement for REST or AsyncAPI.
Adopting multiple competing standards within one organisation creates the same interoperability problems that standards were meant to solve. Define one standard for each communication pattern, then enforce it consistently.
API standards govern how systems exchange data. Master data management governs what that data represents - ensuring the same entity means the same thing everywhere.
4.4 Master data management
Standards govern how data is exchanged. Master data management (MDM) governs how key business entities are defined and maintained across an organisation. MDM creates a single authoritative record for customers, products, locations, and employees, ensuring that the same entity is represented consistently in every system.
The NHS patient matching problem illustrates MDM at the most consequential scale. A patient may be registered under different names, dates of birth, or NHS numbers across different hospital trusts. When a GP refers a patient to a specialist at a different trust, the two trusts may not automatically recognise the records as belonging to the same person. The NHS Master Patient Index (MPI) maintains a canonical patient identifier to address this, but adoption across the 900+ clinical systems remains incomplete as of 2024.
The UK government's GOV.UK One Login programme, launched by GDS in 2022, applies the same MDM principle to citizen identity. Its goal is to give every person who uses government services a single verified digital identity, replacing dozens of separate credentials across departments.

Common misconception
“If data exists in a database, it is well-structured.”
Unstructured and semi-structured data is the norm, not the exception. A database column typed as VARCHAR(255) may contain free-text clinical notes, serialised JSON, comma-separated codes, or encoded binary. The existence of a database schema does not imply data quality or consistency. Organisations often discover the actual structure of their data only when they try to integrate two systems and find that the same column means different things in each.
Master data defines what key entities are. Reference data provides the controlled vocabularies that give those entities consistent meaning across systems.
4.5 Reference data management
Reference data provides the controlled vocabularies and classification schemes that give master data consistent meaning across systems. Country codes, currency codes, industry classification, and postcode formats are all reference data problems.
The ONS (Office for National Statistics) maintains UK authoritative reference data for statistics, including Standard Industrial Classification (SIC) 2007 codes used to categorise businesses. Companies House uses SIC codes when companies register. HMRC uses SIC codes for tax reporting. When a system uses SIC codes and another uses free-text industry descriptions, joining those datasets requires human judgement for every record. The ONS reference data eliminates that judgement by providing a shared vocabulary.
UK postcode validation is another reference data dependency. The ONS Postcode Directory (ONSPD), updated quarterly, maps every current and terminated UK postcode to its geographic coordinates, local authority, and electoral ward. Applications that validate postcodes against the ONSPD can detect invalid entries at input rather than during downstream processing.
“Reference data provides the controlled vocabularies, code lists, and classification schemes that give master data consistent meaning across systems and organisations.”
UK Government Data Standards Authority - Reference Data Management Guidance
The GDPR Article 25 data minimisation principle compounds the reference data problem: if an organisation collects only the minimum data necessary, it must be certain that the fields it does collect are unambiguous. A country field recorded as free text is not minimum viable data; it is maximum viable ambiguity. ISO 3166-1 alpha-2 codes (GB, FR, DE) are the open standard that removes that ambiguity.
In September 2020, Public Health England lost 15,841 COVID-19 test results for eight days. The root cause was a file format limitation. A digital adviser reviewing the incident recommends switching all data transfers to a format with an explicit schema and no row limits. Which principle does this recommendation address?
A payment notification service publishes an event every time a transaction is authorised. Four downstream systems subscribe: a fraud scoring engine, a customer notification service, a loyalty programme, and an accounting system. A junior developer suggests documenting this API using OpenAPI Specification 3.1. A senior engineer disagrees. Who is correct and why?
A large insurer has three legacy platforms, each using a different customer identifier format. A new digital portal must show customers their complete history across all three. A developer estimates the integration requires 50 field mapping scenarios. A data architect argues that adding an API gateway will not reduce the mapping count. What is the architect's reasoning?
Key takeaways
- The PHE COVID-19 data loss in September 2020 was caused by a file format limitation: Excel 2003 silently truncated rows beyond 65,536. Data format governance prevents this class of failure.
- Data standards solve interoperability at the level of syntax and semantics. Without them, every system-to-system integration requires bespoke mapping that compounds in cost as the estate grows.
- OpenAPI 3.1 describes synchronous REST APIs. AsyncAPI 2.6 describes event-driven APIs. GraphQL handles client-driven queries. Using the wrong standard produces APIs that are difficult to tool and maintain.
- Master data management creates single authoritative records for key entities. The NHS Master Patient Index and GOV.UK One Login are national-scale MDM initiatives addressing decades of fragmentation.
- Reference data management provides controlled vocabularies. ISO 3166-1 country codes, ONS SIC 2007 codes, and the ONS Postcode Directory replace ambiguous free-text fields with unambiguous shared identifiers.
Standards and sources cited in this module
Public Health England COVID-19 data loss report, DHSC, October 2020
Technical investigation into the Excel row limit incident
Primary source for the PHE Excel row limit incident described in the opening story. The DHSC investigation confirmed 15,841 positive tests were lost for eight days due to the XLS format row limit.
OpenAPI Specification 3.1, OpenAPI Initiative
Full specification
The REST API description standard discussed in Section 4.3. Required for all UK government published APIs and used by the NHS API catalogue.
AsyncAPI Specification 2.6, AsyncAPI Initiative
Full specification
The event-driven API description standard discussed in Section 4.3 as the correct standard for asynchronous publish-subscribe architectures.
UK Government Open Standards Principles, Cabinet Office, 2012
Principle 4: Use open standards
The government policy that mandates open formats for software, data, and documents in UK public sector systems. Provides the context for the open-versus-proprietary analysis in Section 4.2.
UK Government Data Standards Authority, guidance register
Reference data management guidance
The central authority for government-wide data standards. Quoted in Section 4.5 for the definition of reference data and its role in semantic interoperability.
NHS Data Standards, NHS England
SNOMED CT mandate and Master Patient Index documentation
Provides the 900+ clinical systems figure cited in Section 4.1 and the context for NHS patient matching as a master data management problem.
Standards enable interoperability at the data level. The next module moves to the platform level: how APIs, customer journeys, and platform business models determine who captures value from digital systems.
Module 4 of 15 in Foundations