Designing Data‑Intensive Applications – Detailed Summary and Notes

Martin Kleppmann’s Designing Data‑Intensive Applications (DDIA) is one of the best books on how to design reliable, scalable, and maintainable data systems. These notes summarise the book chapter by chapter in practical terms for backend / system‑design work.

Use this as:

Map of the territory before (or while) reading the book
System‑design refresher before interviews or architecture work

The book is organised in three parts:

Part I – Foundations of Data Systems (chapters 1–4)
Part II – Distributed Data (chapters 5–9)
Part III – Derived Data (chapters 10–12)

Part I – Foundations of Data Systems

Chapter 1 – Reliable, Scalable, and Maintainable Applications

The book starts by defining the three main goals of modern backend systems:

Reliability: The system continues to work correctly (correct service for users, no data loss or corruption) even when things go wrong.
Scalability: The system copes with increased load (more users, more data, more requests) by adding hardware or optimising design.
Maintainability: The system is easy to work on over time: it is observable, simple enough to understand, and can evolve.

Key ideas:

Failures are normal
- Hardware fails (disks, RAM, network links), software has bugs, and humans make mistakes.
- Design for fault tolerance instead of assuming everything works.
Scalability is about load and performance
- Define load with metrics relevant to your system (requests/sec, active users, write throughput, GB/day, etc.).
- Define performance as latency and throughput (p50, p95, p99, not just “average”).
- To cope with load: scale up (bigger machines) or scale out (more machines); each has trade‑offs.
Maintainability pillars
- Operability: metrics, logs, health checks, automation, predictable behaviour.
- Simplicity: keep designs minimal and avoid unnecessary complexity (layers, abstractions, “clever” hacks).
- Evolvability: easy to change and extend over time with tests, modularity, and clear boundaries.

Chapter 2 – Data Models and Query Languages

This chapter compares how we model data and how we query it.

Relational model vs document model
- Relational: tables, rows, columns, strict schema; great for joins and long‑term data integrity.
- Document (e.g. JSON): nested structures, flexible schemas; often better when data is naturally hierarchical and mostly accessed by key.
- Trade‑off: normalisation and joins vs denormalisation and duplication.
NoSQL and the object‑relational mismatch
- ORMs struggle to map rich object graphs to flat tables (joins, lookup tables).
- Document stores avoid some of this by storing an object graph in one document, but then cross‑document relationships and joins are harder.
Graph data models
- For highly connected data (social graphs, knowledge graphs), relational or document models become clumsy.
- Property graphs (nodes + edges with labels/properties) and triple stores (subject–predicate–object) are more natural.
Query languages
- Declarative (SQL, Cypher, SPARQL): say what you want, not how to compute it; the engine figures out the plan.
- Imperative / MapReduce‑style: you write the control flow (map, group, reduce) explicitly.

The takeaway: choose a data model that matches the shape of your data and your access patterns, not just whatever your favourite database supports.

Chapter 3 – Storage and Retrieval

Internally, databases rely on a few core index structures:

Hash indexes: map keys to positions; great for equality lookups (WHERE id = ?) but not ordered scans.
B‑trees: balanced tree of pages on disk; support range scans, prefix queries, and ordered indexes.
LSM‑trees / SSTables: log‑structured merge trees (e.g. LevelDB, RocksDB, Cassandra).
- Writes append to a log and in‑memory structure; background compaction merges sorted segments.
- Typically faster for writes than B‑trees, but reads can be more complex (need Bloom filters, merging segments).

The chapter also explains:

OLTP vs OLAP
- OLTP (transactional): many small reads/writes (CRUD for web/mobile apps).
- OLAP (analytics): big scans, aggregations, reporting, data warehousing.
Column‑oriented storage (for analytics)
- Store data by column instead of by row; compress well and scan fewer bytes when querying a subset of columns.
- Need different write strategies (often write‑optimised then compact/merge later).

Main idea: understanding index and storage structures helps you reason about performance and choose the right database engine for your workload.

Chapter 4 – Encoding and Evolution

Applications need to encode data (to disk, over the network) and evolve schemas over time without breaking compatibility.

Encoding formats
- Language‑specific (Java serialization, .NET, etc.) — not great for evolution or cross‑language use.
- Text formats (JSON, XML) — human readable, flexible, but loose typing and larger.
- Binary formats with schemas (Thrift, Protocol Buffers, Avro) — compact, strongly typed, good for evolution.
Schema evolution
- For backward compatibility, new code must be able to read old data.
- For forward compatibility, old code must survive seeing new fields.
- Schema definition languages and versioning rules (e.g. only add optional fields, avoid renames) make evolution safer.
Dataflow patterns
- Data through databases (one app writes, another reads).
- Data through services (REST/RPC).
- Data through asynchronous messages / logs.

Big idea: treat schemas as contracts and design encoding so you can deploy changes gradually (blue/green, rolling updates) without breaking running systems.

Part II – Distributed Data

Chapter 5 – Replication

Replication means keeping copies of the same data on multiple nodes for high availability, lower latency, and throughput.

Replication styles:

Single‑leader (primary–replica)
- One leader handles writes; followers replicate from the leader’s log.
- Reads can go to followers (eventual consistency) or leader (stronger consistency).
- Need strategies for adding followers, handling failover, and managing replication lag.
Multi‑leader
- Several leaders accept writes (e.g. geo‑distributed).
- Need conflict resolution (e.g. last‑write‑wins, merge functions, CRDTs).
- Useful when offline / multi‑datacenter writes are required, but more complex.
Leaderless (e.g. Dynamo‑style)
- Clients write to multiple replicas; reads query multiple replicas and reconcile.
- Use quorums: R + W > N for N replicas.
- Handle failures with sloppy quorums, hinted handoff, and mechanisms to detect concurrent writes (vector clocks, version vectors).

Replication introduces staleness and consistency trade‑offs (read‑your‑writes, monotonic reads, prefix consistency). Designing for the right level of consistency is application‑specific.

Chapter 6 – Partitioning

Partitioning (sharding) splits data across multiple nodes so you can scale writes and storage.

Partitioning schemes
- By key range: ranges of keys per partition (good for range scans, but risk of hot spots).
- By hash of key: distributes load evenly but loses natural ordering (need secondary indexes for range queries).
Hot spots and skew: some keys/partitions can get more traffic; you may need salted keys, random suffixes, or more dynamic rebalancing.
Secondary indexes in a partitioned world
- Partition by document or by term; each has different routing and fan‑out trade‑offs.
Rebalancing and routing
- Strategies: fixed number of partitions vs dynamic; consistent hashing, lookup tables, or coordination services for routing.

Takeaway: partitioning is as much an operational problem (rebalancing, routing, monitoring) as a data‑modelling one.

Chapter 7 – Transactions

Transactions provide isolation and atomicity on top of storage and replication.

ACID
- Atomicity, Consistency, Isolation, Durability.
- The book emphasises that real systems use many variants and “weak” forms (e.g. eventual consistency, various isolation levels).
Isolation levels
- Read committed, snapshot isolation, repeatable read, serializable.
- Phenomena: dirty reads, non‑repeatable reads, lost updates, write skew, phantoms.
Implementations
- Locks (two‑phase locking).
- MVCC (multi‑version concurrency control) – snapshot isolation with versioned rows.
- Serializable snapshot isolation (SSI) – detects dangerous patterns and aborts transactions to approximate serializability without global locks.

Lesson: many production databases default to weaker isolation for performance. You must understand what guarantees you really get.

Chapter 8 – The Trouble with Distributed Systems

Distributed systems are hard because partial failures and uncertainty are normal:

Networks drop, delay, or reorder messages.
Processes pause (GC, OS scheduling).
Clocks drift and are unreliable for global ordering.

Concepts discussed:

Network faults and timeouts; no perfect failure detector.
Synchronous vs asynchronous models; cloud systems are closer to asynchronous with unbounded delays.
Clocks (time‑of‑day vs monotonic), NTP accuracy limits, dangers of relying on tightly synchronized clocks.
Byzantine faults vs crash faults (most systems only plan for crashes).

Bottom line: assume any component can fail, and design protocols (e.g. retries, idempotency, timeouts, leader election) that tolerate that uncertainty.

Chapter 9 – Consistency and Consensus

This chapter ties together consistency models and consensus algorithms:

Consistency guarantees
- Linearizability: operations appear to take effect atomically in a single global order. Strong but expensive.
- Weaker models allow stale reads or reordering but can be more available or performant.
Ordering and total order broadcast
- Total order broadcast lets multiple nodes observe messages in the same order; underpins many systems (replication logs, state machine replication).
Distributed transactions and atomic commit
- Two‑phase commit (2PC) and its blocking behaviour on failures.
Fault‑tolerant consensus
- Algorithms like Paxos / Raft provide agreement on a value (e.g. a leader, a log entry) in the presence of failures.
- Coordination services (e.g. ZooKeeper, etcd) expose building blocks: leader election, locks, configuration storage.

Takeaway: strong consistency (linearizable, globally ordered logs) is possible but costly; use it where you really need it (metadata, coordination), and use weaker models elsewhere.

Part III – Derived Data

Part III explains that storage systems are not isolated: we often copy and transform data into other forms (indexes, caches, search, analytics) to make it more useful. These are derived views of the primary data.

Chapter 10 – Batch Processing

Batch systems process large volumes of data offline:

Unix philosophy: many small tools connected with pipes; MapReduce is the same idea at cluster scale.
MapReduce and Hadoop
- map to transform and emit key–value pairs; shuffle to group by key; reduce to aggregate.
- Good for log analysis, ETL, large joins.
Data warehousing vs Hadoop
- Warehouses: structured schema, columnar storage, SQL.
- Hadoop: more flexible storage and processing, but often higher latency.
Beyond MapReduce
- Systems like Spark, Flink, etc. add better execution models, in‑memory data, DAGs of stages, and higher‑level APIs.

Key idea: batch jobs are like long‑running functions over an immutable dataset, great for heavy computation and backfilling.

Chapter 11 – Stream Processing

Streams are unbounded sequences of events (logs, user actions, sensor data). Instead of waiting for a batch, you process events continuously:

Messaging systems and logs (Kafka‑like): append‑only, partitioned logs; consumers read at their own pace.
Databases and streams
- Change data capture (CDC) turns DB changes into streams.
- Event sourcing stores state as events; the current state is a projection.
Processing streams
- Stateless operations: filters, maps.
- Stateful operations: windows, aggregations, joins with tables or other streams.
- Time is tricky: event time vs processing time; out‑of‑order events; watermarks.

Lesson: batch and stream processing are two sides of the same idea—functions over immutable data—with different latency trade‑offs.

Chapter 12 – The Future of Data Systems

The final chapter looks at composing many specialised tools into coherent systems:

Data integration using derived data: each tool (DB, search index, cache, analytics system) has its own schema, but they are kept in sync via batch/stream pipelines.
Designing around dataflow rather than around individual databases: think in terms of flows of events and materialized views.
Correctness, constraints, and end‑to‑end arguments: some checks are best done in the application, some in the DB; often both.
Timeliness vs integrity: trade‑off between having fresh but approximate data vs slower but fully consistent data.
Ethics and privacy: data systems can cause harm; we need to design for privacy, transparency, and responsible use.

Core takeaways for system design

Start from requirements around reliability, scalability, and maintainability; then choose tools and patterns.
Understand data models and storage engines so you can predict performance and choose the right DB (relational, document, key‑value, graph, columnar, etc.).
Once you distribute data, you must confront replication, partitioning, transactions, and consistency head‑on; there is no free lunch.
Treat schemas as contracts and design for evolution: rolling upgrades, backward/forward compatibility, and safe data migrations.
Think in terms of dataflow: how data moves between OLTP stores, caches, search, analytics, batch jobs, and stream processors.

If you build or design backend systems, reading the full book is highly recommended. These notes should make it easier to orient yourself and connect the chapters to your day‑to‑day work and system‑design discussions.

quyennv.com