Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, And Maintainable Systems By Martin Kleppmann

By Ben Nadel

Published 2017-10-02 in Books, MongoDB, Redis, SQL, Work

A couple of years ago, I wrote CFRedLock - a ColdFusion implementation of Redlock, which is a distributed-locking algorithm designed by the team behind Redis. Soon after that, my teammate, David Bainbridge, pointed me to an article that completely invalidated the value proposition of Redlock. This was my first introduction to Martin Kleppmann who, at the time, was doing research for this book, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, And Maintainable Systems. Now, fast forward two years and I just finished reading Kleppman's book over the weekend. And let me say, the experience of consuming this magnum opus was just as intense as the applications that it aims to elucidate.

Designing Data-Intensive Applications opens with a quote from Alan Kay on the culture of programming:

Computing is pop culture.... Pop culture holds a disdain for history. Pop culture is all about identity and feeling like you're participating. It has nothing to do with cooperation, the past or the future - it's living in the present. I think the same is true of most people who write code for money. They have no idea where [their culture came from]. (Alan Kay, in interview with Dr Dobb's Journal, 2012)

This quote did a particularly wonderful job of framing the book because it set me up to fully see the superficiality of my own beliefs. I very much identify as a "database person." I find writing SQL exhilarating. I randomly Tweet about how much joy Redis brings into my life. I refer to myself as an "honorary member of the data team" at work because, in the early days, I was the data team.

Suffice it to say, I love data. But, this book was essentially an 800-page (iBooks) "You know nothing, Jon Snow," moment for me.

I think it took me about 4-months to get through this book. I read it in bursts, sometimes going weeks without reading it at all. It reminded me very much of High Performance MySQL: Optimization, Backups, and Replication, in that it was a tome of historical knowledge, technical detail, theory, and practical examples. And, also in that it is not for everyone. Designing Data-Intensive Applications is not a casual read. You really have to be heavily involved in data management for this to be a practical book.

As someone who is [now apparently] only lightly involved in data management, the most glaring take-away from this book is that data management is very hard to get right. So much can go wrong! So much! And, as an application scales, partitions, and shards, things only get more complicated and more brittle. Kleppmann does a wonderful job of illustrating all of the intricacies, both figuratively and literally, across a dizzying number of technologies. But, he left me with the feeling that anything beyond a single machine talking to a single master is bound to be fraught with peril.

It's an intimidating book. But, it's also an empowering book. While it has certainly clarified my place in the universe, it has also made me fell much more comfortable with the idea of participating in data-architecture conversations at work. It has also keyed me into a wide array of problem solving techniques. Now, when I have to think about and discuss data architecture with my team, I can [hopefully] come at it from a more worldly point of view as opposed to just leaning on what I know already (where's my SQL at?).

Since so much of the information in this book was new to me, I stopped highlighting passages and just tried to absorb everything I could. But, there are a few passages that I wanted to share. In the beginning, Kleppmann reminds us that there is a time and a place for everything:

An architecture that scales well for a particular application is built around assumptions of which operations whill be common and which will be rare - the load parameters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive. In an early-stage startup or an unproven product it's unusually more important to be able to to iterate quickly on product features than it is to scale to some hypothetical future load. (Page 44, iBooks)

With all the insights that this book offers, it can be very alluring to try and build the best possible solution right out of the gate. I mean, why have a single read/write database when you could have a master database, event streams, message queues, and materialized read-only views? Well, because you don't need them - at least not at first. Kleppmann provides us with a way to think deeply about data; but, he also tempers that information with reminders to think deeply about our users first; and then, to evolve our data management solution as the application's needs evolve.

Kleppman also points out that, from a business point-of-view, constraints are generally not as constrained as we make them out to be:

In many business contexts, it is actually acceptable to temporarily violate a constraint and fix it up later by apologizing. The cost of apology (in terms of money or reputation) varies, but it is often quite low; you can't unsend an email, but you can send a follow-up email with a correction. If you accidentally charge a credit card twice, you can refund one of the charges and the cost to you is just the processing fee and perhaps a customer complaint. Once money has been paid out of an ATM, you can't directly get it back, although in principle you can send debt collectors to recover the money if the account was overdrawn and the customer won't way it back.

Whether the cost of apology is acceptable is a business decision. If it is acceptable, the traditional model of checking all constraints before even writing data is unnecessarily restrictive, and a linearizable constraint is not needed. It may well be a reasonable choice to go ahead and write optimistically, and to check the constraint after the fact. You can still ensure that the validation occurs before doing things that would be expensive to recover from, but that doesn't imply you must do the validation before you even write data. (Page 778, iBooks)

Ultimately, I think both of these passages dove-tail with a quote that resonated so deeply with me that I felt compelled to stop reading and share it on Twitter. This isn't actually from Kleppmann directly, but rather, a quote by John Gall that introduced one of Kleppmann's chapters:

With all of this information on data processing, it's natural to feel a sense of unbridled enthusiasm. It's natural to want to build the ultimate solution. But, it's important to build a proven system first. Then, once you know you have a proven, working, meaningful, value-add system, you can start to evolve it such that it can cope with emerging user and business requirements.

Designing Data-Intensive Applications by Martin Kleppmann is an intense book. It's not for everyone. But, if you love data and data management, this book will be fascinating. For me, a lot of the information in this book represented first exposure. Which means that, while I can't necessarily apply everything I just read, I do feel like I have a much better understanding of the data-technology landscape. And, that will hopefully have a positive affect on the way I think about data going forward.

Short link: https://bennadel.com/3344

Reader Comments

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.