Module 22 of 26 · Practice & Strategy

Models, abstraction, and databases

30 min read 3 outcomes Interactive + terminal 5 references

By the end of this module you will be able to:

Explain Codd's relational model and the guarantees ACID properties provide
Apply the CAP theorem to choose a database for a given consistency-availability trade-off
Identify the correct NoSQL model (document, key-value, column-family, graph) for a data pattern

Social network graph visualisation showing interconnected nodes representing follower relationships

Real-world architecture decision · 2012

Twitter migrates from relational to graph model for social connections

In 2012, Twitter's engineering team faced a specific scaling problem: the follower graph. Every time a user loaded their timeline, the system needed to find all accounts they followed, retrieve recent tweets from each, and rank them. With 500 million accounts and average follow counts in the hundreds, the SQL join required to compute a timeline involved potentially millions of rows per query.

The relational model represents relationships as foreign keys: a followers table with user_id and followed_id columns. Traversing this to find "all accounts followed by accounts followed by user X" (two hops) requires two self-joins on a table with billions of rows. Performance degrades exponentially with traversal depth.

Twitter instead moved social graph storage to a system (FlockDB, later replaced with more specialised graph infrastructure) where nodes are users and edges are follow relationships stored adjacently. Graph traversal becomes a pointer-following operation rather than a join. The abstraction matches the data: social networks are graphs, and graph databases model them directly.

Twitter had 500 million users, each with hundreds of followers and followees. The follower graph did not fit the relational model efficiently. What makes graph databases the right abstraction for social network relationships?

With the learning outcomes established, this module begins by examining the relational model and acid in depth.

22.1 The relational model and ACID

Edgar Codd published "A Relational Model of Data for Large Shared Data Banks" in 1970. His insight was that data should be stored in tables (relations) with rows (tuples) and columns (attributes), and that the entire database should be accessible via a declarative query language that specified what data was wanted rather than how to retrieve it. SQL, standardised in 1974, became that language.

ACID properties define the guarantees a relational database provides for transactions. Atomicity means a transaction either completes fully or not at all: there is no partial update state. Consistency means the database moves from one valid state to another valid state, enforcing all defined constraints. Isolation means concurrent transactions do not see each other's in-progress changes (at the configured isolation level). Durability means committed transactions survive system crashes: the data is written to persistent storage before the commit is acknowledged.

These guarantees matter because real business operations involve multiple related changes. A bank transfer must debit one account and credit another atomically. If the system crashes after the debit but before the credit, the atomicity guarantee means the debit is rolled back. Without ACID, financial systems, inventory management, and booking systems would produce incorrect results during failures or concurrent access.

SQL query optimisation uses an execution plan: a tree of operations (scans, joins, sorts, aggregations) that the query planner selects to minimise cost. Indexes on frequently queried columns allow range scans instead of full table scans. EXPLAIN ANALYSE in PostgreSQL reveals the actual execution plan and row counts, which is the first tool to reach for when a query is slow.

With an understanding of the relational model and acid in place, the discussion can now turn to the cap theorem and distributed databases, which builds directly on these foundations.

“A relational database is self-describing. The data dictionary is itself stored in tables. This creates a uniform abstraction: everything the system knows about itself is queryable in the same language as the data.”
E.F. Codd, 'A Relational Model of Data for Large Shared Data Banks', Communications of the ACM (1970) - Section 1
Codd's self-describing property is why relational databases have remained dominant for 50 years. The information schema (table names, column types, constraints, indexes) is itself a set of tables, queryable with SQL. This uniformity makes database introspection, migration tooling, and documentation generation tractable.

“The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.”
Edsger W. Dijkstra, The Humble Programmer (1972) - ACM Turing Award Lecture
Dijkstra's insight applies directly to data modelling. A well-designed abstract model does not hide detail; it organises detail into layers where each layer's decisions are precise and independent.

Loading interactive component...

22.2 The CAP theorem and distributed databases

Eric Brewer's CAP theorem (2000, formally proved by Gilbert and Lynch in 2002) states that a distributed data system can simultaneously provide at most two of three guarantees: Consistency (every read receives the most recent write or an error), Availability (every request receives a response, without guarantee it is the most recent write), and Partition tolerance (the system continues operating when network partitions occur and messages between nodes are dropped).

In practice, network partitions in distributed systems are not optional: any distributed system must tolerate them. This means the real trade-off is between consistency and availability during a partition. A CP system (consistency + partition tolerance) will refuse to serve reads or writes it cannot guarantee are consistent during a partition. An AP system (availability + partition tolerance) will serve potentially stale reads rather than becoming unavailable.

The right choice depends on the application. A banking ledger requires CP: serving a stale balance could allow overdrafts. A product catalogue or social media feed tolerates AP: a slightly stale product count or a delayed post count is acceptable. Most modern distributed databases (Cassandra, DynamoDB, CockroachDB, MongoDB) implement tunable consistency, allowing the application to choose the consistency level per query.

With an understanding of the cap theorem and distributed databases in place, the discussion can now turn to nosql data models and when to use them, which builds directly on these foundations.

Common misconception

“NoSQL databases do not support ACID transactions.”

This was true of early NoSQL systems but is no longer accurate. MongoDB has supported multi-document ACID transactions since version 4.0 (2018). Amazon DynamoDB supports transactions across multiple items. CockroachDB and Google Spanner are distributed SQL databases offering full ACID guarantees at global scale. The modern choice is not between SQL (ACID) and NoSQL (no ACID) but between different data models and consistency trade-offs, many of which can now be implemented with transactional guarantees.

22.3 NoSQL data models and when to use them

Four primary NoSQL data models address different access patterns that the relational model handles poorly or inefficiently.

Document stores (MongoDB, Firestore) store JSON or BSON documents. Ideal for data with variable schema (product catalogues with different attributes per category), hierarchical data that would require many joins (a blog post with embedded comments), or rapid development where schema evolves frequently.
Key-value stores (Redis, DynamoDB) retrieve values by a primary key with microsecond latency. Ideal for caching, session storage, leaderboards, and rate limiting where the access pattern is always get/set by a known key.
Column-family stores (Apache Cassandra, HBase) store data in columns grouped into column families, optimised for wide-row writes and time-series data. Ideal for IoT sensor ingestion, event logs, and metrics where data is appended chronologically and queried by time range.
Graph databases (Neo4j, Amazon Neptune) store nodes and edges with arbitrary properties, optimised for multi-hop traversal. Ideal for social networks, recommendation engines, fraud detection (unusual transaction graph patterns), and knowledge graphs where relationship traversal is the primary operation.

Polyglot persistence architecture showing PostgreSQL, Redis, Cassandra, and Neo4j serving different workloads — Modern production architectures use polyglot persistence: different databases for different workloads. PostgreSQL for transactional data, Redis for caching, Cassandra for time-series, and Neo4j for graph traversal.

22.4 Check your understanding

A payment processing system must ensure that when a customer pays for an order, the payment record and the order status update either both succeed or neither does. The system handles 10,000 transactions per second across three database nodes. Which ACID property is most critical here, and why?

An IoT platform ingests 50,000 temperature sensor readings per second, each with a sensor ID, timestamp, and temperature value. Readings are always appended (never updated) and queried by sensor ID and time range. Which database model is best suited to this workload?

A distributed database cluster experiences a network partition: nodes A and B can communicate with each other but not with node C. Under the CAP theorem, the system must choose between consistency and availability. A financial ledger application is using this database. Which choice is correct and why?

Loading interactive component...

Key takeaways

Codd's relational model (1970) stores data in tables with rows and columns, queryable via declarative SQL. ACID properties (Atomicity, Consistency, Isolation, Durability) provide the correctness guarantees for concurrent, failure-prone environments.
The CAP theorem shows that distributed systems can guarantee at most two of: Consistency, Availability, Partition tolerance. Since partition tolerance is mandatory, the real choice is CP (refuse stale reads) vs AP (allow stale reads) during a network partition.
NoSQL models solve specific access patterns: document stores for variable-schema data, key-value for fast single-key lookup, column-family for high-throughput time-series, and graph for multi-hop relationship traversal.
Polyglot persistence uses different databases for different workloads in the same system. The right database choice depends on the access pattern, consistency requirement, and scale, not on any single technology preference.

Standards and sources cited in this module

E.F. Codd, 'A Relational Model of Data for Large Shared Data Banks', ACM (1970)
The founding paper of relational database theory. Introduced the concepts of relations, tuples, and the precursor to SQL.
Seth Gilbert and Nancy Lynch, 'Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services', ACM SIGACT News (2002)
Formal proof of the CAP theorem. The authoritative reference for the consistency/availability trade-off in distributed systems.
Amazon DynamoDB team, 'Dynamo: Amazon's Highly Available Key-value Store', SOSP (2007)
Describes the eventual consistency model, consistent hashing, and vector clocks used by DynamoDB. Foundational to understanding AP-oriented distributed databases.
Apache Cassandra documentation: Data Modelling
Explains partition keys, clustering columns, and why column-family data models must be query-driven rather than entity-driven.
Martin Fowler and Pramod Sadalage, 'NoSQL Distilled' (2012)
thorough comparison of NoSQL data models with guidance on choosing between them.

Back: Mathematical foundations for data Next: Advanced analytics and inference

Module 22 of 26 · Practice & Strategy