Loading lesson...
Loading lesson...
This is the second of 8 Foundations modules. In Module 1, you learned what AI is and how AI, ML, and deep learning relate. This module examines the single most important factor in AI system quality: data.
The Amazon story is not an isolated failure. It is a predictable consequence of a principle that applies to every AI system: the quality, representativeness, and integrity of the training data determine the quality of the system's outputs. No amount of algorithmic sophistication can compensate for fundamentally flawed data.
If you already work with data regularly, use the knowledge checks to validate your understanding and skip to Module 3: How machines learn.
With the learning outcomes established, this module begins by examining garbage in, garbage out: why data quality is everything in depth.
The phrase "garbage in, garbage out" dates back to the 1950s, long before machine learning existed. But it has never been more relevant. In traditional software, a bug in the code produces a predictable wrong output. In machine learning, a problem in the data can produce an unpredictable wrong output that looks correct.
Andrew Ng, co-founder of Google Brain and Stanford adjunct professor, has argued that the AI community has been "model-centric" for too long, focusing on building better algorithms while neglecting data quality. His "data-centric AI" movement argues that for most practical applications, improving data quality yields larger gains than improving model architecture.
“Instead of focusing on the code, companies should focus on engineering the data used to develop AI systems.”
Andrew Ng - Data-Centric AI keynote, NeurIPS 2021 workshop
Ng's argument is empirically supported. In a 2021 landing.ai study, improving label consistency on a steel defect detection dataset increased model accuracy from 76.2% to 93.1% without any model changes. The model was already good enough; the data was the bottleneck.
With an understanding of garbage in, garbage out: why data quality is everything in place, the discussion can now turn to common data problems, which builds directly on these foundations.
Data problems fall into several categories, each with different causes and different fixes:
With an understanding of common data problems in place, the discussion can now turn to the data pipeline: from raw data to model-ready features, which builds directly on these foundations.
Common misconception
“More data always makes AI systems better”
More data helps only when the additional data is relevant, representative, and correctly labelled. Adding one million low-quality images does not help a medical imaging model. In many cases, a smaller, carefully curated dataset outperforms a larger noisy one. Teams should invest in data curation and quality assurance processes, not just data collection volume. A data quality budget is as important as a compute budget.
Common misconception
“AI can overcome biased data through better algorithms”
Algorithms learn patterns in the data. If the data contains systematic bias, the algorithm will learn that bias. Debiasing techniques exist (re-sampling, re-weighting, adversarial debiasing) but they mitigate rather than eliminate the problem. Organisations must audit their data pipelines for representativeness before training models. Fairness constraints applied after training are a remediation, not a solution.
Data does not arrive ready for model training. It passes through a pipeline of stages, each of which can introduce or remove problems:
With an understanding of the data pipeline: from raw data to model-ready features in place, the discussion can now turn to feature engineering: helping models see what matters, which builds directly on these foundations.
Raw data often contains the signal a model needs, but in a form it cannot easily learn from. Feature engineering transforms raw inputs into representations that make patterns more visible to the model.
Examples of effective feature engineering:
For deep learning, manual feature engineering is less critical because the network learns its own features. But even in deep learning, thoughtful data preparation (augmentation, normalisation, tokenisation) remains essential.
“Applied machine learning is basically feature engineering.”
Andrew Ng - Stanford CS229 lecture notes
While deep learning has reduced the need for manual feature engineering in some domains (vision, NLP), Ng's point remains true for the majority of production ML systems which use tabular data and classical algorithms. Feature engineering is where domain expertise translates into model performance.
A bank builds a loan approval model trained on historical decisions. The model denies loans to applicants from certain postcodes at a higher rate, even when their financial profile is similar to approved applicants from other areas. What is the most likely root cause?
A data scientist has 50,000 labelled images for training a medical imaging model. 49,500 are normal and 500 show the disease. She trains the model and reports 99% accuracy. What is the problem?
During data cleaning, a team discovers that 8% of records in their customer dataset have missing values for the 'annual_income' field. Which approach is most appropriate?
Reuters, 'Amazon scraps secret AI recruiting tool that showed bias against women' (October 2018)
Full article
Primary source for the opening case study. Documents how Amazon's experimental hiring tool learned gender bias from historical resume data. The system was never used as the sole determinant in hiring but demonstrated the principle of data-encoded bias.
Section 3 (Label Error Prevalence), Table 1
Demonstrated that major ML benchmarks (ImageNet, CIFAR-10, MNIST) contain 3-10% label errors. These errors systematically affect model training and evaluation, undermining confidence in published benchmark results.
Andrew Ng, 'Data-Centric AI' (NeurIPS 2021 workshop keynote)
Opening keynote
Ng's argument that the AI community should shift from model-centric to data-centric approaches. Cited the landing.ai steel defect example where improving label consistency increased accuracy from 76.2% to 93.1% without model changes.
UK Information Commissioner's Office, 'Guidance on AI and Data Protection' (2023)
Section on Fairness in AI
UK regulatory guidance on ensuring AI systems comply with data protection principles. Relevant to the data quality and bias discussion because GDPR Article 22 gives individuals the right not to be subject to solely automated decisions with legal effects.
Section 3 (Types of Bias), Section 5 (Mitigation Approaches)
thorough taxonomy of 23 types of bias in ML systems. Used as background for the bias discussion in Section 2.2. Covers historical bias, representation bias, measurement bias, and aggregation bias among others.
You now understand why data is the foundation of every AI system and how data problems create system failures. The next question is: once you have good data, how does a machine actually learn from it? Module 3 introduces the three paradigms of machine learning: supervised, unsupervised, and reinforcement learning.
Module 2 of 24 · AI Foundations