Data lifecycle and flow
By the end of this module you will be able to:
- Identify the stages of the data lifecycle (create, store, use, share, archive, destroy) for a described dataset
- Apply the GDPR storage limitation principle to a retention decision
- Explain why data lineage matters for audit and troubleshooting
- Describe what metadata should be captured at each lifecycle stage
Every piece of data has a lifespan. It is created or collected, stored, processed, shared, archived, and eventually destroyed. Managing data through these stages is not bureaucracy. It is how organisations meet regulatory obligations, control costs, and maintain trust with the people whose data they hold.
With the learning outcomes established, this module begins by examining each lifecycle stage in depth.
9.1 The six stages
The data lifecycle is the sequence of stages through which data passes from initial creation or collection to its eventual destruction:
- Create/Collect: data is generated or ingested from source systems.
- Store: data is persisted in databases, data lakes, or warehouses.
- Use/Process: data is transformed, analysed, or fed into models.
- Share/Publish: data is distributed to consumers via APIs or reports.
- Archive: data is moved to long-term storage to meet retention obligations.
- Destroy: data is securely deleted when retention periods expire.
Each stage carries distinct metadata requirements, quality considerations, and governance obligations.
Click through each stage in the interactive diagram below to see what happens, what metadata to capture, and what risks arise when a stage is neglected.
With an understanding of data creation, storage, use, sharing, archiving, and destruction, the discussion can now turn to retention and deletion obligations, which builds directly on these foundations.
9.2 Retention and deletion obligations
The retention period is the defined length of time that data must be kept before it can or must be deleted. Periods may be set by law, regulation, contract, or internal policy. UK organisations navigate multiple overlapping frameworks.
“Personal data shall be kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed.”
GDPR Regulation (EU) 2016/679 - Article 5(1)(e), Storage limitation
There is no single prescribed retention period in GDPR. The period depends on purpose. HMRC recommends six years for most financial records with personal data. NHS health records have longer prescribed periods under the NHS Records Management Code. The key principle: keep data only as long as the purpose requires, then delete it.
ISO/IEC 27001:2022 Annex A.8.10 requires organisations to implement controls for secure deletion when storage media is repurposed or disposed of. "Secure" means more than pressing delete: it means verifiable, documented destruction of all copies.
Common misconception
“Deleting a row from the production database means the data is gone.”
Deleting from production while retaining copies in backups, audit logs, disaster recovery sites, and data warehouse extracts does not constitute compliant disposal under GDPR. A deletion programme must inventory all locations where the data exists (production, backups, archives, downstream systems) and address every instance before marking records as destroyed.
With an understanding of retention and deletion obligations in place, the discussion can now turn to data lineage, which builds directly on these foundations.
9.3 Data lineage
Data lineage is a record of a dataset's origins, transformations, and movements over time. It documents which source systems contributed, which transformations were applied at each stage, and which downstream systems or reports consume the output. Lineage enables three practical capabilities:
- Debugging: when a report shows an unexpected value, lineage allows tracing the value back through transformations to the source.
- Regulatory compliance: under GDPR Article 14, data subjects have the right to know the source of data held about them. Under FCA regulations, firms must demonstrate the provenance of data used in regulatory reporting.
- Impact analysis: before changing a source system's schema, lineage shows which downstream datasets and reports will be affected.
Modern data catalogue tools (Apache Atlas, Collibra, Alation) capture lineage automatically by monitoring SQL queries and ETL (Extract, Transform, Load) operations.
“The controller shall provide the data subject with information as to the source of the personal data, and if applicable, whether it came from publicly accessible sources.”
GDPR Regulation (EU) 2016/679 - Article 14(2)(f), Information to be provided where personal data have not been obtained from the data subject
This right means organisations must know where their data came from. Without lineage records, answering a data subject access request (DSAR) becomes guesswork. Lineage is not just a technical nice-to-have; it is a regulatory requirement for personal data.
With an understanding of data lineage in place, the discussion can now turn to metadata at every stage, which builds directly on these foundations.
9.4 Metadata at every stage
Metadata is data about data. It is essential at every lifecycle stage for discoverability, quality assessment, and compliance. The interactive diagram above lists specific metadata requirements per stage. The overriding principle: capture metadata at the point of creation, not retrospectively.
Common misconception
“We can add metadata later when we build the data catalogue.”
Metadata that is not captured at creation time is extremely difficult to reconstruct. Legacy datasets without recorded origin, purpose, or consent basis cannot be compliantly used, shared, or deleted. Retrospective metadata reconstruction projects often cost more than the data is worth. The time to capture metadata is when the data first enters the organisation.
A GP surgery collects patient consultation notes containing personal and special category health data. A practice manager is unsure whether to retain the notes for 5 years, 10 years, or indefinitely. Which GDPR principle is most directly relevant?
An analyst discovers that a revenue figure in the monthly board report is incorrect. The report was generated from a data warehouse loaded by a pipeline sourcing three CRM systems. What is the most efficient way to find where the error was introduced?
Your organisation deletes customer records from the production database when they close their account. An internal audit discovers the same records still exist in nightly backups, the data warehouse, and two downstream reporting extracts. Is the deletion GDPR-compliant?
Key takeaways
- The data lifecycle runs from create/collect through store, use/process, share/publish, archive, and destroy. Each stage has distinct metadata, quality, and governance requirements.
- GDPR Article 5(1)(e) (storage limitation) requires personal data to be deleted when it is no longer necessary for its collection purpose. Retention periods must be documented and enforced through automation.
- Data lineage records the origin, transformations, and movements of a dataset. It enables debugging, regulatory compliance (GDPR Article 14), and impact analysis before schema changes.
- Metadata captured at creation time is essential and nearly impossible to reconstruct later. The time to record source, purpose, and consent basis is at the point of collection.
- Deletion means addressing all copies: production, backups, archives, data warehouses, and derived datasets. A row deleted from production is not GDPR-compliant if copies persist elsewhere.
Standards and sources cited in this module
Article 5(1)(e) (Storage limitation), Article 14(2)(f) (Source disclosure)
Storage limitation principle governing retention and the right to know data sources. Both drive lifecycle management practices.
ISO/IEC 27001:2022
Annex A.8.10 (Information deletion)
Requires controls for secure deletion when media is repurposed or disposed of. Defines 'secure' as verifiable and documented.
NHS Records Management Code of Practice (2021)
Section 4 (Retention schedules)
Clinical record retention periods used in the quiz scenario. Specifies 10 years for GP consultation records.
Storage limitation chapter
ICO interpretation of Article 5(1)(e) with practical examples for UK organisations.
NIST SP 800-188 (2023), De-Identifying Government Datasets
Full document
Government data retention framework that has influenced international approaches to lifecycle management.
Module 9 of 26 · Data Foundations

