Module 6 of 26

Open data and FAIR as a guiding lens

What open data means in practice, the FAIR principles for research data, Tim Berners-Lee's 5-star model, and the licensing frameworks that make data legally and technically reusable.

By the end of this module you will be able to:

  • Distinguish legally open, technically open, and practically open data
  • Apply the four FAIR principles to evaluate a dataset for reusability
  • Map a dataset to the correct star level in the 5-star open data model
  • Select the appropriate open licence for a given data sharing scenario

What open data means: three dimensions

The Open Knowledge Foundation's Open Definition v2.1 specifies that data is open if anyone is free to use, modify, and share it for any purpose. This deceptively simple definition breaks into three dimensions that must all hold simultaneously.

  1. Legally open: released under a licence that grants permission to use, copy, redistribute, and adapt with no or minimal restrictions. Placing data on a website without a licence does not make it open: copyright applies by default, and no one has explicit permission to do anything with it. A dataset marked "for academic use only" is legally closed even if publicly accessible.
  2. Technically open: published in a format that can be processed without proprietary software. A PDF containing a table is legally open under OGL but technically closed: the data cannot be extracted and analysed programmatically without error-prone parsing. CSV or JSON under the same licence is both legally and technically open.
  3. Practically open: discoverable, well-described, and usable without significant investigative effort. Data published on an unlisted FTP server with no metadata, no data dictionary, and no contact information is legally and technically open but practically closed: most potential users will never find it, and those who do cannot interpret it without extensive reverse-engineering.

With an understanding of what makes data legally, technically, and practically open, the discussion can now turn to the FAIR principles, which builds directly on these foundations.

The FAIR principles

The FAIR principles were published by Wilkinson et al. in Scientific Data (2016) as a framework for maximising the reuse of research data. FAIR is not a licence framework or a file format requirement. It is a set of properties that data and its metadata should exhibit to support automated discovery and reuse by both humans and machines.

Findable: Data and metadata are assigned a globally unique, persistent identifier (DOI, ORCID for people, ROR for organisations). Metadata is indexed in a searchable resource. The identifier in the metadata resolves to the data. This addresses the primary reason research data goes unused: it cannot be found.

Accessible: The identifier resolves via a standardised, open, free, and universally implementable protocol (HTTP, HTTPS). Metadata remains accessible even if the data itself is embargoed or restricted. Accessibility does not mean the data must be free or publicly available: it means the access mechanism is open and standardised.

Interoperable: Data uses a formal, accessible, shared, and broadly applicable language for knowledge representation (controlled vocabularies, ontologies). Data includes qualified references to other data. This enables automated integration across datasets from different sources without manual mapping.

Reusable: Metadata richly describes the data with a plurality of accurate and relevant attributes. Data is released with a clear and accessible data usage licence. Provenance is documented. Data meets community standards for the domain. This is the terminal goal: data that can be reproduced and combined with confidence.

With an understanding of the fair principles in place, the discussion can now turn to the 5-star open data model, which builds directly on these foundations.

Good data management is not a goal in itself, but rather the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the original data publication.

Wilkinson et al., Scientific Data 3, 160018 (2016): The FAIR Guiding Principles for scientific data management and stewardship - Table 1, FAIR Principles

The 5-star open data model

Tim Berners-Lee proposed the 5-star deployment scheme in 2010 to provide a progressive framework for publishers improving their open data quality. Each star level adds to all previous levels.

  • 1 star: Data is available on the web in any format under an open licence. A scanned PDF of a table under OGL qualifies.
  • 2 stars: Data is available in a machine-readable structured format (Excel spreadsheet rather than scanned image).
  • 3 stars: Data is in a non-proprietary format (CSV, JSON, XML rather than XLS). No specialist software is required to open it.
  • 4 stars: Data uses W3C standards (RDF, SPARQL) so that others can point to it using URIs. Each data element has its own stable URL.
  • 5 stars: Data is linked to other people's data to provide context (linked open data). The dataset cites and connects to related datasets using shared identifiers, enabling automated graph traversal across data sources.

Most government open data in the UK operates at 3-star level (CSV under OGL). The UK National Biodiversity Network Atlas and some components of data.gov.uk achieve 4-star level. True 5-star linked open data remains rare outside academic and semantic web contexts.

With an understanding of the 5-star open data model in place, the discussion can now turn to open data licences, which builds directly on these foundations.

Open data licences

Licence selection determines what downstream users can do with data. Mismatched licences block data combination: a dataset under CC BY-SA 4.0 cannot be legally combined and published with a dataset under CC BY-ND 4.0 because the output cannot simultaneously satisfy both licence conditions.

Open Government Licence v3.0 (OGL v3.0) is the UK public sector standard for Crown copyright data. It permits use, adaptation, and redistribution for any purpose provided attribution is given and the OGL notice is preserved. OGL is compatible with CC BY 4.0 and CC BY-SA 4.0. Ordnance Survey OpenData, Companies House data, and most ONS statistics are released under OGL.

Creative Commons Attribution 4.0 (CC BY 4.0) permits use, adaptation, and redistribution for any purpose including commercial, provided attribution is given. It is the dominant licence for open research data and is compatible with OGL v3.0. It imposes no share-alike requirement, meaning derived works may use different licences.

Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0)adds a copyleft condition: derivative works must be published under the same or a compatible licence. This prevents proprietary lock-in of derived datasets but restricts combination with non-share-alike materials.

Creative Commons Zero (CC0) is a public domain dedication. The publisher waives all copyright and database rights to the extent permitted by law. No attribution is legally required, although community norms still expect citation. Wikidata, most US government federal data, and many genomic reference datasets use CC0. It is the least restrictive option and creates the fewest downstream combination problems.

Common misconception

Putting data on a public website makes it open data.

Publication without an explicit open licence does not create open data. Under UK and EU copyright law, the default position is that all rights are reserved unless explicitly licensed. A publicly accessible dataset with no licence statement cannot legally be copied, redistributed, or incorporated into other products. The three requirements for open data are all non-negotiable: a legal open licence (OGL, CC BY, CC0, or equivalent), a machine-readable non-proprietary format, and sufficient metadata for discovery and interpretation. A CSV with no licence statement is technically accessible but legally closed. Adding an OGL or CC0 statement to an existing dataset is the minimum intervention required to make it legally open.

Common misconception

Open data means anyone can use it for anything.

Open data licences vary significantly. The UK Open Government Licence requires attribution. Creative Commons CC-BY-SA requires derivative works to use the same licence. Some open datasets contain derived personal data that still falls under GDPR. Open does not mean unrestricted; it means the conditions of use are explicit and permissive.

Key takeaways

  • Open data requires three simultaneous conditions: legally open (explicit open licence permitting any use), technically open (non-proprietary machine-readable format), and practically open (discoverable with sufficient metadata). All three must hold.
  • The FAIR principles (Findable, Accessible, Interoperable, Reusable) from Wilkinson et al. 2016 provide a framework for maximising research data reuse. FAIR addresses metadata quality and persistent identifiers, not just file format or licence.
  • The 5-star open data model provides a progressive quality ladder: 1-star (any format, open licence) through 5-star (linked open data with URIs connecting to related datasets). Most UK government open data operates at 3-star (CSV, OGL).
  • UK open data licences: OGL v3.0 (Crown copyright, compatible with CC BY 4.0), CC BY 4.0 (attribution required, any purpose), CC BY-SA 4.0 (attribution + share-alike), CC0 (public domain dedication, no conditions). Licence incompatibility blocks data combination.
  • The Ordnance Survey OpenData release demonstrated that the economic value of open geographic data can exceed 10 times the production cost. The constraint on reuse is typically the licence, not the data itself.

You now understand when data is truly open and reusable. The next module shifts from access to presentation: how to visualise data honestly, choose the right chart types, and design for accessibility. Good visualisation is the difference between insight and noise.

Standards and sources cited in this module

  1. Open Knowledge Foundation: Open Definition v2.1

    Formal definition of open knowledge and open data: legal, technical, and practical openness requirements.

  2. Wilkinson et al. (2016): The FAIR Guiding Principles - Scientific Data 3, 160018

    Original FAIR principles paper: Findable, Accessible, Interoperable, Reusable with implementation guidance for research data.

  3. Tim Berners-Lee: 5-star open data deployment scheme (2010)

    Progressive 5-star framework for open data quality from any-format under open licence to fully linked open data.

  4. Open Government Licence v3.0 (UK National Archives)

    UK public sector open data licence: permitted uses, attribution requirements, and compatibility with Creative Commons licences.

  5. Deloitte / DECC: Market Assessment of Public Sector Information (2013)

    Economic analysis estimating the social and economic value generated by UK open government data releases including Ordnance Survey OpenData.