Module 2 of 24 · Foundations

Data as fuel

30 min read 3 outcomes Data noise tool + drag challenge 5 sources cited

This is the second of 8 Foundations modules. In Module 1, you learned what AI is and how AI, ML, and deep learning relate. This module examines the single most important factor in AI system quality: data.

By the end of this module you will be able to:

Explain why data quality determines AI system quality
Identify common data problems: bias, noise, missing values, and label errors
Describe the data pipeline from collection through cleaning to feature engineering

The Amazon story is not an isolated failure. It is a predictable consequence of a principle that applies to every AI system: the quality, representativeness, and integrity of the training data determine the quality of the system's outputs. No amount of algorithmic sophistication can compensate for fundamentally flawed data.

If you already work with data regularly, use the knowledge checks to validate your understanding and skip to Module 3: How machines learn.

With the learning outcomes established, this module begins by examining garbage in, garbage out: why data quality is everything in depth.

2.1 Garbage in, garbage out: why data quality is everything

The phrase "garbage in, garbage out" dates back to the 1950s, long before machine learning existed. But it has never been more relevant. In traditional software, a bug in the code produces a predictable wrong output. In machine learning, a problem in the data can produce an unpredictable wrong output that looks correct.

Andrew Ng, co-founder of Google Brain and Stanford adjunct professor, has argued that the AI community has been "model-centric" for too long, focusing on building better algorithms while neglecting data quality. His "data-centric AI" movement argues that for most practical applications, improving data quality yields larger gains than improving model architecture.

“Instead of focusing on the code, companies should focus on engineering the data used to develop AI systems.”
Andrew Ng - Data-Centric AI keynote, NeurIPS 2021 workshop
Ng's argument is empirically supported. In a 2021 landing.ai study, improving label consistency on a steel defect detection dataset increased model accuracy from 76.2% to 93.1% without any model changes. The model was already good enough; the data was the bottleneck.

With an understanding of garbage in, garbage out: why data quality is everything in place, the discussion can now turn to common data problems, which builds directly on these foundations.

2.2 Common data problems

Data problems fall into several categories, each with different causes and different fixes:

Bias occurs when the training data does not represent the population the model will serve. Amazon's hiring tool is the textbook example: ten years of resumes from a male-dominated industry trained the model to prefer male candidates.
Noise is random variation in the data that does not reflect the underlying pattern. Sensor errors, typos in manually entered data, and inconsistent measurement methods all introduce noise. High noise makes it harder for the model to learn the true signal.
Missing values create gaps that the model must handle. A medical dataset missing blood pressure readings for 30% of patients forces a choice: drop those patients (losing information), impute values (guessing), or use a model that handles missing data natively.
Label errors occur when the "correct answer" in supervised learning data is wrong. A 2021 study by Curtis Northcutt et al. found that major ML benchmarks (ImageNet, CIFAR-10, MNIST) contain 3-10% label errors. Models trained on these datasets learn some of these errors.
Class imbalance happens when one category vastly outnumbers another. A fraud detection dataset with 99.9% legitimate transactions and 0.1% fraud makes it hard for the model to learn what fraud looks like. A model that simply predicts "not fraud" for everything achieves 99.9% accuracy but catches zero fraud.

With an understanding of common data problems in place, the discussion can now turn to the data pipeline: from raw data to model-ready features, which builds directly on these foundations.

Common misconception

“More data always makes AI systems better”

More data helps only when the additional data is relevant, representative, and correctly labelled. Adding one million low-quality images does not help a medical imaging model. In many cases, a smaller, carefully curated dataset outperforms a larger noisy one. Teams should invest in data curation and quality assurance processes, not just data collection volume. A data quality budget is as important as a compute budget.

Common misconception

“AI can overcome biased data through better algorithms”

Algorithms learn patterns in the data. If the data contains systematic bias, the algorithm will learn that bias. Debiasing techniques exist (re-sampling, re-weighting, adversarial debiasing) but they mitigate rather than eliminate the problem. Organisations must audit their data pipelines for representativeness before training models. Fairness constraints applied after training are a remediation, not a solution.

2.3 The data pipeline: from raw data to model-ready features

Data does not arrive ready for model training. It passes through a pipeline of stages, each of which can introduce or remove problems:

Collection. Where does the data come from? Web scraping, API calls, sensor readings, user interactions, manual entry, or purchased datasets. Each source has different quality characteristics and different bias profiles.
Cleaning. Remove duplicates, fix formatting inconsistencies, handle missing values, correct obvious errors. This stage typically consumes 60-80% of a data scientist's time on a project.
Exploration. Statistical summaries, distributions, correlations, and visualisations help you understand what the data contains and what patterns exist before training any model.
Feature engineering. Transform raw data into features the model can learn from. For example, converting a date of birth to an age, or combining latitude and longitude into a distance from a reference point.
Splitting. Divide the data into training, validation, and test sets. The model learns from training data, is tuned using validation data, and is evaluated on test data it has never seen. We cover this in detail in Module 3.

With an understanding of the data pipeline: from raw data to model-ready features in place, the discussion can now turn to feature engineering: helping models see what matters, which builds directly on these foundations.

Loading interactive component...

2.4 Feature engineering: helping models see what matters

Raw data often contains the signal a model needs, but in a form it cannot easily learn from. Feature engineering transforms raw inputs into representations that make patterns more visible to the model.

Examples of effective feature engineering:

Date → day of week, month, is_weekend: A raw timestamp is hard to learn from. Decomposing it reveals patterns (e.g., fraud peaks on weekends).
Text → TF-IDF or embeddings: Raw text is a string. Converting it to numerical vectors allows mathematical operations on meaning.
Address → postcode district + deprivation index: A full address is high-cardinality. A postcode district combined with a public deprivation score captures the relevant socioeconomic signal.
Normalisation: Scaling features to similar ranges (e.g., 0-1) prevents features with large numeric ranges (income in pounds) from dominating features with small ranges (number of children).

For deep learning, manual feature engineering is less critical because the network learns its own features. But even in deep learning, thoughtful data preparation (augmentation, normalisation, tokenisation) remains essential.

“Applied machine learning is basically feature engineering.”
Andrew Ng - Stanford CS229 lecture notes
While deep learning has reduced the need for manual feature engineering in some domains (vision, NLP), Ng's point remains true for the majority of production ML systems which use tabular data and classical algorithms. Feature engineering is where domain expertise translates into model performance.

2.5 Check your understanding

A bank builds a loan approval model trained on historical decisions. The model denies loans to applicants from certain postcodes at a higher rate, even when their financial profile is similar to approved applicants from other areas. What is the most likely root cause?

A data scientist has 50,000 labelled images for training a medical imaging model. 49,500 are normal and 500 show the disease. She trains the model and reports 99% accuracy. What is the problem?

During data cleaning, a team discovers that 8% of records in their customer dataset have missing values for the 'annual_income' field. Which approach is most appropriate?

Loading interactive component...

Key takeaways

Data quality is the single most important factor in AI system quality. No algorithm can compensate for biased, noisy, or mislabelled data. The Amazon hiring algorithm case demonstrates how historical bias in training data produces discriminatory outputs.
Common data problems include bias (non-representative data), noise (random variation), missing values, label errors (wrong 'correct answers'), and class imbalance (one category vastly outnumbering others).
The data pipeline runs through collection, cleaning, exploration, feature engineering, and splitting. Cleaning typically consumes 60-80% of project time. Each stage can introduce or remove data quality problems.
Feature engineering transforms raw data into representations models can learn from. For tabular data with classical ML, feature engineering is often the highest-use activity. For deep learning, the network learns features but data preparation (augmentation, normalisation) remains critical.
More data is not always better. Data quality, relevance, and representativeness matter more than volume. A carefully curated small dataset often outperforms a large noisy one.

Standards and sources cited in this module

Reuters, 'Amazon scraps secret AI recruiting tool that showed bias against women' (October 2018)
Full article
Primary source for the opening case study. Documents how Amazon's experimental hiring tool learned gender bias from historical resume data. The system was never used as the sole determinant in hiring but demonstrated the principle of data-encoded bias.
Curtis G. Northcutt, Anish Athalye, Jonas Mueller, 'Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks', NeurIPS 2021
Section 3 (Label Error Prevalence), Table 1
Demonstrated that major ML benchmarks (ImageNet, CIFAR-10, MNIST) contain 3-10% label errors. These errors systematically affect model training and evaluation, undermining confidence in published benchmark results.
Andrew Ng, 'Data-Centric AI' (NeurIPS 2021 workshop keynote)
Opening keynote
Ng's argument that the AI community should shift from model-centric to data-centric approaches. Cited the landing.ai steel defect example where improving label consistency increased accuracy from 76.2% to 93.1% without model changes.
UK Information Commissioner's Office, 'Guidance on AI and Data Protection' (2023)
Section on Fairness in AI
UK regulatory guidance on ensuring AI systems comply with data protection principles. Relevant to the data quality and bias discussion because GDPR Article 22 gives individuals the right not to be subject to solely automated decisions with legal effects.
Mehrabi et al., 'A Survey on Bias and Fairness in Machine Learning', ACM Computing Surveys, Vol. 54, No. 6 (2021)
Section 3 (Types of Bias), Section 5 (Mitigation Approaches)
thorough taxonomy of 23 types of bias in ML systems. Used as background for the bias discussion in Section 2.2. Covers historical bias, representation bias, measurement bias, and aggregation bias among others.

You now understand why data is the foundation of every AI system and how data problems create system failures. The next question is: once you have good data, how does a machine actually learn from it? Module 3 introduces the three paradigms of machine learning: supervised, unsupervised, and reinforcement learning.

Previous: What AI is and is not Next: How machines learn

Module 2 of 24 · AI Foundations

Loading lesson...

2.1 Garbage in, garbage out: why data quality is everything

“Instead of focusing on the code, companies should focus on engineering the data used to develop AI systems.”

Andrew Ng - Data-Centric AI keynote, NeurIPS 2021 workshop

Ng's argument is empirically supported. In a 2021 landing.ai study, improving label consistency on a steel defect detection dataset increased model accuracy from 76.2% to 93.1% without any model changes. The model was already good enough; the data was the bottleneck.

2.2 Common data problems

Data problems fall into several categories, each with different causes and different fixes:

Bias occurs when the training data does not represent the population the model will serve. Amazon's hiring tool is the textbook example: ten years of resumes from a male-dominated industry trained the model to prefer male candidates.
Noise is random variation in the data that does not reflect the underlying pattern. Sensor errors, typos in manually entered data, and inconsistent measurement methods all introduce noise. High noise makes it harder for the model to learn the true signal.
Missing values create gaps that the model must handle. A medical dataset missing blood pressure readings for 30% of patients forces a choice: drop those patients (losing information), impute values (guessing), or use a model that handles missing data natively.
Label errors occur when the "correct answer" in supervised learning data is wrong. A 2021 study by Curtis Northcutt et al. found that major ML benchmarks (ImageNet, CIFAR-10, MNIST) contain 3-10% label errors. Models trained on these datasets learn some of these errors.
Class imbalance happens when one category vastly outnumbers another. A fraud detection dataset with 99.9% legitimate transactions and 0.1% fraud makes it hard for the model to learn what fraud looks like. A model that simply predicts "not fraud" for everything achieves 99.9% accuracy but catches zero fraud.

2.3 The data pipeline: from raw data to model-ready features

Data does not arrive ready for model training. It passes through a pipeline of stages, each of which can introduce or remove problems:

Collection. Where does the data come from? Web scraping, API calls, sensor readings, user interactions, manual entry, or purchased datasets. Each source has different quality characteristics and different bias profiles.
Cleaning. Remove duplicates, fix formatting inconsistencies, handle missing values, correct obvious errors. This stage typically consumes 60-80% of a data scientist's time on a project.
Exploration. Statistical summaries, distributions, correlations, and visualisations help you understand what the data contains and what patterns exist before training any model.
Feature engineering. Transform raw data into features the model can learn from. For example, converting a date of birth to an age, or combining latitude and longitude into a distance from a reference point.
Splitting. Divide the data into training, validation, and test sets. The model learns from training data, is tuned using validation data, and is evaluated on test data it has never seen. We cover this in detail in Module 3.

2.4 Feature engineering: helping models see what matters

Examples of effective feature engineering:

Date → day of week, month, is_weekend: A raw timestamp is hard to learn from. Decomposing it reveals patterns (e.g., fraud peaks on weekends).
Text → TF-IDF or embeddings: Raw text is a string. Converting it to numerical vectors allows mathematical operations on meaning.
Address → postcode district + deprivation index: A full address is high-cardinality. A postcode district combined with a public deprivation score captures the relevant socioeconomic signal.
Normalisation: Scaling features to similar ranges (e.g., 0-1) prevents features with large numeric ranges (income in pounds) from dominating features with small ranges (number of children).

Key takeaways

Data quality is the single most important factor in AI system quality. No algorithm can compensate for biased, noisy, or mislabelled data. The Amazon hiring algorithm case demonstrates how historical bias in training data produces discriminatory outputs.

Common data problems include bias (non-representative data), noise (random variation), missing values, label errors (wrong 'correct answers'), and class imbalance (one category vastly outnumbering others).

The data pipeline runs through collection, cleaning, exploration, feature engineering, and splitting. Cleaning typically consumes 60-80% of project time. Each stage can introduce or remove data quality problems.

Feature engineering transforms raw data into representations models can learn from. For tabular data with classical ML, feature engineering is often the highest-use activity. For deep learning, the network learns features but data preparation (augmentation, normalisation) remains critical.

More data is not always better. Data quality, relevance, and representativeness matter more than volume. A carefully curated small dataset often outperforms a large noisy one.

Standards and sources cited in this module

Reuters, 'Amazon scraps secret AI recruiting tool that showed bias against women' (October 2018)

Full article

Primary source for the opening case study. Documents how Amazon's experimental hiring tool learned gender bias from historical resume data. The system was never used as the sole determinant in hiring but demonstrated the principle of data-encoded bias.

Curtis G. Northcutt, Anish Athalye, Jonas Mueller, 'Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks', NeurIPS 2021

Section 3 (Label Error Prevalence), Table 1

Demonstrated that major ML benchmarks (ImageNet, CIFAR-10, MNIST) contain 3-10% label errors. These errors systematically affect model training and evaluation, undermining confidence in published benchmark results.

Andrew Ng, 'Data-Centric AI' (NeurIPS 2021 workshop keynote)

Opening keynote

Ng's argument that the AI community should shift from model-centric to data-centric approaches. Cited the landing.ai steel defect example where improving label consistency increased accuracy from 76.2% to 93.1% without model changes.

UK Information Commissioner's Office, 'Guidance on AI and Data Protection' (2023)

Section on Fairness in AI

UK regulatory guidance on ensuring AI systems comply with data protection principles. Relevant to the data quality and bias discussion because GDPR Article 22 gives individuals the right not to be subject to solely automated decisions with legal effects.

Mehrabi et al., 'A Survey on Bias and Fairness in Machine Learning', ACM Computing Surveys, Vol. 54, No. 6 (2021)

Section 3 (Types of Bias), Section 5 (Mitigation Approaches)

thorough taxonomy of 23 types of bias in ML systems. Used as background for the bias discussion in Section 2.2. Covers historical bias, representation bias, measurement bias, and aggregation bias among others.