Module 13 of 26 · Applied

Governance and stewardship

30 min read 4 outcomes Interactive catalogue browser + drag challenge 5 standards cited

By the end of this module you will be able to:

  • Compare DAMA and IBM governance maturity frameworks
  • Describe the purpose and components of a data catalogue
  • Explain the data contracts pattern and the problem it solves
  • Design a basic lineage tracking approach for a described pipeline
Analytics dashboard on a laptop representing data pipeline monitoring

Real-world case · 2022

Schema changes at Uber silently broke downstream analytics for days.

In 2022, Uber's data engineering team described a persistent problem: schema changes in upstream data sources were silently breaking downstream analytics pipelines. A source team would rename a column or change a data type, and the first indication of a problem was a failed report days later.

The data contracts pattern emerged as the solution: a formal, machine-enforced agreement between producers and consumers. The previous module covered how data moves through pipelines. This module covers how governance ensures those pipelines remain trustworthy.

A source team renamed a column. Downstream reports broke. Nobody noticed for three days. Could a formal agreement between teams have prevented this?

Governance without tooling is just policy. Governance with tooling becomes operational. Data catalogues make data discoverable. Data contracts make interfaces reliable. Lineage makes failures traceable. Together they form the infrastructure of trust.

With the learning outcomes established, this module begins by examining data catalogues in depth.

13.1 Data catalogues

A data catalogue is an organised inventory of an organisation's data assets. It stores technical metadata (schema, column types, row counts), business metadata (descriptions, owners, stewards, quality scores), and operational metadata (freshness, lineage, access patterns). Modern catalogues (Collibra, Alation, DataHub, OpenMetadata) crawl data sources automatically and present a searchable interface.

Without a catalogue, analysts spend an estimated 30% of their time searching for data, verifying its meaning, and confirming whether they are allowed to use it. A well-maintained catalogue reduces this to minutes.

Data governance requires that data be inventoried, classified, and assigned ownership before it can be effectively managed.

ISO/IEC 38505-1:2017 - Clause 5, Data governance principles

ISO 38505 establishes that governance starts with knowing what data you have. A catalogue is the operational implementation of this principle. Without an inventory, governance decisions are made in the dark.

With an understanding of data catalogues in place, the discussion can now turn to data contracts, which builds directly on these foundations.

Data catalogue giving analysts a single place to discover, understand, and verify data assets before using them
A data catalogue gives analysts a single place to discover, understand, and verify data assets before using them.

13.2 Data contracts

A data contract is a formal agreement between a data producer (the team that generates data) and a data consumer (the team that uses it). The contract specifies the schema (column names, types, constraints), freshness guarantees (data available within N minutes of source change), quality expectations (completeness above 99%, no nulls in key fields), and who to contact when something breaks.

Data contracts are enforced by automated checks that run when data is published. If the data does not meet the contract, the publish is blocked and the producer is notified. This shifts the cost of schema changes from consumers (who discover breakage days later) to producers (who must update the contract before changing their schema).

A data contract is an interface specification: it defines the shape, quality, and freshness of data that consumers can rely on.

Andrew Jones, 'Data Mesh in Practice' (2023) - Chapter 6, Data Contracts

Jones frames contracts as interface specifications, borrowing the concept from software engineering's API contracts. Just as an API consumer relies on a documented interface, a data consumer relies on a documented data contract.

Common misconception

Data contracts add bureaucracy that slows teams down.

Without contracts, schema changes silently break downstream pipelines. The cost of debugging silent failures (Uber reported days of broken analytics) far exceeds the cost of maintaining a contract. Contracts shift the cost of change from consumers (who suffer breakage) to producers (who must communicate changes). This is not bureaucracy; it is engineering discipline.

With an understanding of data contracts in place, the discussion can now turn to lineage in practice, which builds directly on these foundations.

Data contracts documenting the producer-consumer agreement including schema, freshness, quality thresholds, and escalation contacts
Data contracts document the agreement between producers and consumers: schema, freshness, quality thresholds, and escalation contacts.

13.3 Lineage in practice

Module 9 introduced data lineage conceptually. In the Applied stage, lineage becomes operational. Modern lineage tools (OpenLineage, Marquez, Collibra Lineage) capture lineage automatically by instrumenting data processing frameworks (Spark, Airflow, dbt). Every time a transformation runs, the tool records what inputs were read, what transformations were applied, and what outputs were produced.

Column-level lineage traces individual fields through transformations. If a revenue figure in a board report is incorrect, column-level lineage shows exactly which source column and which transformation step contributed to that value.

Common misconception

We track lineage at the table level. That is sufficient.

Table-level lineage shows that table A feeds table B. But when a specific column value is wrong, you need column-level lineage to identify which source field and which transformation produced it. Table-level lineage narrows the search; column-level lineage finds the answer.

Loading interactive component...
Loading interactive component...
13.4 Check your understanding

Uber's data engineering team reported that schema changes in upstream sources silently broke downstream analytics for days. Which governance mechanism would most directly prevent this?

An analyst spends two hours searching for a customer dataset, verifying its meaning, and confirming access permissions. Which tool addresses this problem?

A data quality issue is reported in a monthly revenue figure. The data warehouse team uses table-level lineage to identify that the revenue table was populated from three source tables. They still cannot find which specific column introduced the error. What is missing?

Loading interactive component...

Key takeaways

  • A data catalogue is an organised inventory storing technical, business, and operational metadata. It reduces the time analysts spend searching for data from hours to minutes.
  • Data contracts are formal, machine-enforced agreements between producers and consumers specifying schema, freshness, and quality thresholds. They prevent silent breakage from schema changes.
  • Column-level lineage traces individual fields through transformations to source columns. Table-level lineage narrows the search; column-level lineage finds the answer.
  • ISO/IEC 38505-1:2017 establishes that governance starts with inventory: you cannot govern what you have not catalogued.
  • Implement governance incrementally: inventory first, then ownership, then quality metrics, then contracts and lineage. Trying to do everything at once is the most common failure mode.

Standards and sources cited in this module

  1. ISO/IEC 38505-1:2017, Data governance

    Clause 5, Data governance principles

    Establishes that data must be inventoried and classified before governance can be effective.

  2. DAMA-DMBOK2 (2017)

    Chapter 3, Data Governance

    Industry standard guidance on governance structures, stewardship, and maturity models.

  3. UK Data Standards Authority guidance (2023)

    Full guidance

    UK government approach to data standards and interoperability governance.

  4. Andrew Jones, 'Data Mesh in Practice' (2023)

    Chapter 6, Data Contracts

    Source for the data contracts pattern and its relationship to data mesh architecture.

  5. Uber Engineering Blog, 'Data Quality at Uber' (2022)

    Full post

    Opening case study: schema changes silently breaking downstream analytics and the emergence of data contracts as a solution.

Module 13 of 26 · Applied Data