MODULE 3 OF 5 · APPLIED

APIs, Integration, and Event-Driven Patterns

30 min read 3 outcomes Interactive quiz

By the end of this module you will be able to:

Compare REST, GraphQL, and gRPC for a given integration scenario and justify the appropriate choice with concrete trade-offs
Apply a versioning strategy that allows APIs to evolve without breaking existing consumers, and define what constitutes a breaking change
Distinguish between event notification, event-carried state transfer, and event sourcing, and match each pattern to appropriate use cases

Developer working with payment technology representing Stripe's API-first approach to digital payments (photo on Unsplash)

Real-world programme · Stripe · 2010 to present

Seven lines of code. One trillion dollars in payment volume. The API became the product.

Stripe launched in 2010 with a 7-line code integration that replaced weeks of bank negotiation, legal review, and technical implementation. To accept card payments before Stripe, a business needed a merchant acquiring relationship with a bank, a payment gateway contract, PCI compliance certification, and an in-house integration team. Stripe reduced all of that to a single API call.

By 2023, Stripe processed over $1 trillion in annual payment volume. The ecosystem built on that API is worth multiples more than Stripe itself: Shopify, Deliveroo, Monzo, and thousands of other businesses built their payment infrastructure on Stripe's API. The API was not the interface to the product; the API was the product. Every design decision Stripe made about its API, from version naming (Stripe uses dated versions: 2023-10-16) to webhook payload structure, had consequences for millions of downstream integrations.

The previous module covered how to measure what digital systems produce. This module covers how digital systems communicate: which API pattern to select for a given integration scenario, how to version APIs without breaking existing consumers, and which event-driven pattern fits which problem.

When your API becomes more valuable than your underlying service, what does that tell you about the economics of integration?

With the learning outcomes established, this module begins by examining rest, graphql, and grpc in depth.

9.1 REST, GraphQL, and gRPC

Three API paradigms dominate modern digital integration. Each addresses a different set of integration constraints, and choosing the wrong one for the context creates friction that compounds over years.

REST (Representational State Transfer) is the most widely adopted pattern for public and partner APIs. Roy Fielding defined REST in his 2000 doctoral dissertation at UC Irvine, specifying six architectural constraints: client-server separation, statelessness, cacheability, a uniform interface, a layered system, and optional code on demand. The uniform interface constraint is the most consequential: REST APIs use standard HTTP methods (GET, POST, PUT, PATCH, DELETE) against resource URLs, making them predictable and straightforward to consume without bespoke tooling.

REST's limitation is over-fetching and under-fetching. A REST endpoint returns whatever its designer chose to return. A mobile client needing only a user's name and profile image from an endpoint that returns 40 fields fetches 40 fields over a constrained mobile network. A client needing data from three related resources makes three sequential or parallel requests. Neither is efficient.

GraphQL was developed at Meta (Facebook) in 2012, open-sourced in 2015, and transferred to the GraphQL Foundation in 2019. Clients specify exactly which fields they need in a single query. The server returns only those fields. GitHub's v4 API and Shopify's Storefront API use GraphQL because their consumers span mobile apps, web clients, and third-party integrators with radically different data requirements. The trade-off is server-side complexity: a GraphQL server must implement a typed schema, a resolver for every field, query depth limiting to prevent denial-of-service through deeply nested queries, and a caching strategy that cannot rely on HTTP cache headers alone.

gRPC (Google Remote Procedure Call) uses Protocol Buffers (Protobuf) as its serialisation format and HTTP/2 as transport. Binary serialisation is faster than JSON; HTTP/2 multiplexing eliminates the per-request connection overhead of HTTP/1.1. gRPC generates strongly typed client and server stubs from a .proto schema definition, making integration type-safe by default. It is best suited to internal service-to-service communication in microservices architectures where both ends of the connection are under organisational control and performance matters. Browsers do not natively support gRPC, disqualifying it for public APIs.

REST for public and partner APIs. GraphQL for diverse client data requirements. gRPC for internal service-to-service calls where performance is a constraint. These are default starting positions informed by context, not absolute rules.

Common misconception

“REST is always better than GraphQL because it is simpler.”

REST is simpler for simple resource models with uniform data requirements across consumers. When different consumer types need substantially different subsets of entity data, REST requires either multiple endpoints (increasing API surface area and maintenance burden) or over-fetching (increasing response size and network cost). GraphQL eliminates both problems at the cost of server-side complexity. The choice depends on the diversity of consumer data requirements, not on an absolute simplicity ranking.

REST, GraphQL, and gRPC each describe the shape of an API. Once an API is deployed and consumers depend on it, the challenge shifts to managing change without breaking integrations. Section 9.2 covers API versioning strategies.

9.2 API versioning strategies

API versioning manages changes to an API in a way that does not break existing consumers when the API evolves. A well-designed versioning strategy must define three things: how versions are named, what constitutes a breaking change, and how long deprecated versions are maintained before removal.

URL versioning embeds the version in the URL path: /api/v1/payments, /api/v2/payments. This is the most visible approach and the easiest to route at the API gateway layer. Its weakness is that it encourages hard-coding of version numbers by consumers, creating a migration forcing function when a version is deprecated.

Accept-Version header versioning specifies the desired version in an HTTP request header: Stripe-Version: 2023-10-16. Stripe uses dated versioning: each API version is named by the date it was released. An integration built on 2023-10-16 continues to receive the exact API behaviour from that date, regardless of later changes, until the integration explicitly opts into a newer version. This approach is optimal for platforms with many integrators who cannot all upgrade simultaneously.

A breaking change is any modification that causes existing consumer code to fail or behave incorrectly without modification. Removing a field, changing a field's data type, changing a response status code, or removing an endpoint are breaking changes. Adding a new optional field to a response is not a breaking change for most consumers, though strictly typed consumers using schema validation may reject unexpected fields.

The UK Government API Standards recommend a minimum of six months' notice before removing or breaking a versioned API. For APIs consumed by automated payroll, benefit eligibility, or tax filing systems, six months represents the minimum time for consuming organisations to plan, develop, test, and deploy updated integrations. Many commercial API agreements specify longer periods by contract.

“API providers should give a minimum of six months' notice before removing or breaking a versioned API.”
UK Government API Technical and Data Standards - api.gov.uk/standards
This standard reflects the operational reality that API consumers embed version dependencies deeply into production systems with long change cycles. A government benefit eligibility API used by 200 local authority case management systems cannot be broken on 30 days' notice: local authorities have procurement processes, change management requirements, and testing environments that cannot be mobilised in under six months. The standard reflects the minimum, not the target.

Request-response APIs (REST, GraphQL, gRPC) work well when a consumer needs data on demand. But some systems need to react to things happening in real time, without polling. Section 9.3 introduces event-driven architecture and its three distinct patterns.

9.3 Event-driven architecture patterns

Event-driven architecture (EDA) is a design paradigm where components communicate by publishing events to a shared infrastructure rather than calling each other directly. Producers do not know who will consume their events; consumers do not know who produced them. This decoupling allows independent scaling and deployment of services, and allows new consumers to be added without changing producer code.

Three distinct event patterns exist within EDA. Choosing the wrong pattern for the context introduces either unnecessary coupling (notification when state transfer is needed) or unnecessary payload size and complexity (sourcing when notification is sufficient). The choice is driven by the data requirements of consumers and the audit obligations of the system.

“Event Sourcing is the idea that we can ensure every change to the state of an application is captured in an event object, and that these event objects are stored in the sequence they were applied for the same lifetime as the application state itself.”
Fowler, M., Event Sourcing - martinfowler.com, 2005
Fowler's definition highlights what distinguishes event sourcing from the other two patterns: the event log is the system of record, not a notification mechanism or a data transfer channel. Current state is derived from the log, not stored separately. This provides a complete audit trail by design, but requires rebuilding current state by replaying the log, which adds complexity that is only justified when the audit requirement is a core system property.

Loading interactive component...

Developers reviewing REST versioning strategy and webhook payload structures, illustrating downstream integration design decisions — API integration connects disparate systems across organisational boundaries. Stripe's API connected thousands of online businesses to payment infrastructure that previously required months of bank negotiation. The API design decision (REST, versioning strategy, webhook payload structure) determined the total cost of integration for every downstream consumer.

Event-driven patterns describe the logical model for how systems communicate. Section 9.4 covers the infrastructure that implements EDA in practice: message brokers, and when to choose Kafka versus RabbitMQ versus a managed cloud alternative.

9.4 Message brokers

A message broker receives events from producers and routes them to consumers, decoupling the two ends of the communication. Producers publish to topics or queues without knowing the identity or location of consumers. Consumers subscribe without knowing who produces the events they receive. New consumers can be added without modifying producer code.

Apache Kafka was developed at LinkedIn in 2010, open-sourced that year, and became the dominant open-source event streaming platform. Kafka's core abstraction is a durable, ordered, partitioned, append-only log. Events written to a Kafka topic are retained for a configurable period (days to indefinitely) and can be replayed from any offset. Consumer groups maintain their own offsets, allowing independent consumers to read from different positions in the same log. Kafka is appropriate for high-volume, high-throughput scenarios: financial transaction streams, CDC events, clickstream data, and telemetry.

RabbitMQ is a queue-based message broker: messages are acknowledged and removed from the queue once consumed. RabbitMQ does not retain messages for replay. It is appropriate for task queues, work distribution, and scenarios where exactly-once processing matters more than event replay.

AWS SQS and AWS SNS are managed queue and notification services. SQS provides at-least-once delivery with configurable message retention up to 14 days; SNS provides fan-out to multiple subscribers. Both are appropriate for cloud-native architectures on AWS where operational overhead matters more than the advanced features Kafka provides.

Choosing between Kafka and managed queue services requires answering three questions: is event replay needed (if so, Kafka); what is the throughput requirement (Kafka handles millions of events per second; managed queues handle thousands); and how much operational expertise is available (Kafka cluster management is operationally demanding; SQS and EventBridge require almost none).

Common misconception

“Kafka is just a message queue.”

Kafka is a distributed, partitioned, replicated log. Unlike a queue, where messages are removed after consumption, Kafka retains all events for a configured retention period. Multiple independent consumer groups can read from the same topic at different offsets. Messages can be replayed from any historical position. This persistent log model is what enables the event sourcing pattern and CDC-to-bronze pipeline ingestion. A traditional message queue cannot support these use cases because it discards messages after delivery.

Message brokers decouple producers and consumers. CQRS takes separation of concerns one step further by splitting the data model itself. Section 9.5 explains CQRS and when this extra complexity is justified.

9.5 CQRS overview

CQRS (Command Query Responsibility Segregation) separates the write model of a system (commands that change state) from the read model (queries that return data). The two models can use different data structures, different storage technologies, and different consistency guarantees, each optimised for its specific purpose.

The motivation for CQRS is the tension between write and read optimisation. A data model normalised for efficient writes (minimal duplication, foreign key relationships) is often poorly suited to complex analytical reads that require joining many tables. Conversely, a denormalised read model optimised for fast dashboards is cumbersome to keep consistent when writes occur. CQRS allows each model to be optimised independently.

HMRC applied CQRS to the MTD tax account read service. The write model records each VAT submission event in normalised form; the read model materialises a pre-computed view of the taxpayer's current compliance position, updated asynchronously after each write event. The read service can return a taxpayer's full compliance view in under 100 milliseconds because the view is pre-built, not computed on demand.

CQRS adds significant complexity: two data models, asynchronous synchronisation between them, and eventual consistency for reads. It is appropriate only when a demonstrated performance or consistency problem cannot be solved by simpler means. Most digital services do not need CQRS and should not adopt it as a default architectural pattern.

Loading interactive component...

Event log architecture enabling CDC-to-bronze pipeline ingestion and event sourcing with independent consumer offsets — Apache Kafka operates as a persistent, partitioned, replayable log rather than a message queue. Events are retained indefinitely and multiple consumer groups can read from the same topic at independent offsets. This log model enables CDC-to-bronze pipeline ingestion and event sourcing use cases that traditional queues cannot support.

Loading interactive component...

9.6 Check your understanding

A UK government department is building an open API for a land registry service. External consumers include mortgage lenders (needing title and charge data), conveyancers (needing planning and environmental history), and property search platforms (needing address, price, and description only). The technical architect is choosing between REST with multiple consumer-specific endpoints and GraphQL. Which recommendation is best justified?

A retail platform publishes an inventory.updated event containing only the SKU identifier and a timestamp. During a Black Friday peak, three consuming systems (finance, warehouse, and fraud detection) simultaneously receive the event and each makes a separate callback to the pricing API for current cost data. The pricing API becomes a bottleneck under the combined load. Which pattern change most directly resolves the root cause?

A digital team is evaluating event sourcing for its order management system. The head of engineering argues the pattern is appropriate because they need a full audit trail of every order state change for regulatory compliance. A senior architect agrees on the requirement but questions the implementation approach. Which response best justifies the architect's caution?

Check your understanding

A team implements CQRS for an e-commerce platform. After a customer places an order, the order confirmation page shows 'No orders found' for 2 seconds before displaying the order. What causes this and how should it be addressed?

Key takeaways

REST suits public and partner APIs with uniform data requirements; GraphQL suits diverse client data requirements by allowing field-level selection; gRPC suits internal service-to-service communication where binary serialisation performance matters.
API versioning must be backed by a breaking change policy and a minimum deprecation notice period. UK Government API Standards require at least six months before removing or changing a versioned API.
Event notification keeps payloads minimal but requires consumers to call back for data; event-carried state transfer includes all data in the event, eliminating callbacks at the cost of larger payloads; event sourcing treats the event log as the system of record.
Kafka is a persistent, partitioned, replayable log, not a message queue. Its retention model enables event replay, consumer group independence, and CDC-to-bronze ingestion. Managed queues (SQS, RabbitMQ) discard messages after delivery and cannot support these patterns.
CQRS separates the write model from the read model, allowing each to be optimised independently. It is appropriate when a demonstrated performance or consistency problem exists, not as a default architectural choice.
Stripe's dated versioning (2023-10-16) demonstrates that versioning strategy is a product decision with ecosystem-scale consequences. Every API design choice propagates to millions of downstream integrations.

Standards and sources cited in this module

Fielding, R.T., Architectural Styles and the Design of Network-based Software Architectures
Doctoral dissertation, UC Irvine, 2000
Original definition of the six REST architectural constraints. Referenced in Section 9.1 as the foundational source for REST design properties.
Fowler, M., Event Sourcing
martinfowler.com, 2005
Canonical description of event sourcing as treating the event log as the system of record. Quoted in Section 9.3.
UK Government API Technical and Data Standards
api.gov.uk/standards
Six-month minimum deprecation notice requirement for versioned APIs. Quoted in Section 9.2 and used as the governance standard for API lifecycle management.
Stripe API Documentation and Versioning Policy
stripe.com/docs/api/versioning
Primary case study source for the opening story and for dated API versioning strategy in Section 9.2. Stripe's 2023-10-16 versioning format is cited as a reference implementation.
Apache Kafka Documentation
kafka.apache.org/documentation
Distributed log architecture, consumer group offset model, and partition-based throughput scaling. Referenced in Section 9.4 for broker selection trade-offs.
GraphQL Specification
spec.graphql.org
Field selection, schema, and resolver model. Referenced in Section 9.1 for GraphQL trade-offs compared to REST.

Integration patterns connect systems. The next module takes a wider view: capability maps, value streams, and enterprise architecture frameworks that help organisations decide where to invest in digital capability and where to accept constraints.

Previous: Analytics and measurement Next: Capability maps and TOGAF

Module 9 of 15 in Applied