Last week, after posting a review of Designing Data-Intensive Applications by Martin Kleppmann, fellow InVision engineer - Ben Darfler - recommended that I take a look at I Heart Logs by Jay Kreps. After making my way though Kleppmann's magnum opus, I Heart Logs was an easy and welcome read that I got through in just three sittings. While the book doesn't go into too much technical detail, it effectively paints the picture of an infrastructure that allows all facets of your organization to tap into the vast and diverse amounts of data that your organization collects. And, at the center of this scalable, transformable, real-time processing infrastructure is the re-playable, append-only log.
| || || |
| || |
| || || |
Historically, when I think of the term "logs", I think of the place that I record application errors. Or, perhaps the place where Apache / nginx records all of the incoming requests and HTTP response codes. And, while these are certainly logs, the focus of "I Heart Logs" is on the type of log known as a "commit log" or a "journal". These are append-only sequences of records order by time.
More importantly, these logs are the source of truth for the state of data across the entire organization. And, because this state is centralized, it can then become the source of truth for a horizontally scaled, ever-evolving, multi-faceted application architecture. Since this is a wildly new concept for me as web developer - who happens to love databases - one of the best descriptions that Kreps offers up is the view of a log-centric architecture though the lens of database indices:
Maybe if you squint a bit, you can see the whole of your organization's systems and data flows as a single very complicated distributed database. You can view all the individual query-oriented systems (Redis, SOLR, Hive tables, and so on) as just particular indexes on your data. You can view a stream processing system like Storm of Samza as just a very well-developed trigger-and-view materialization mechanism. Classical database people, I have noticed, like this view very much because it finally explains to them what on earth people are doing with all these different data systems - they are just different index types! (Kindle, Location 722)
This is fascinating! I have no practical experience with this stuff; I still have two feet very much planted in the world of monoliths. But, I have certainly considered using some sort of centralized messaging system as a means to do cache-busting across services. As it turns out, Kreps even describes this centralized log as a more comprehensive messaging system:
You can think of a log as acting as a kind of messaging system with durability guarantees and strong ordering semantics. In distributed systems, this model of communication sometimes goes by the (somewhat terrible) name of atomic broadcast. (Kindle, Location 295)
The book discusses how logs relate to producers and consumer, stream processing, intermediary logs / streams, and application integration; but, it doesn't go into much technical detail as to how this can all be accomplished. That said, Jay Kreps - as it turns out - is the original author of the tremendously successful open-source application, Kafka; and, he has used this approach (and Kafka) to create a logging system at LinkedIn that handles hundreds-of-billions of message each day. I think it's safe to say that it's a proven approach.
At InVision, we're going to start using Kafka as an event-bus for cross-system communication. After having read I Heart Logs, I am especially excited to dig into Kafka so that I can get a better sense of the future that Kreps describes. I know that an event-bus is not exactly what Kreps is discussing in the book; but, the more comfortable I get with a technology like Kafka - and the more hands-on experience that I get with distributed systems and microservices - the more I'll be able to wrap my head around the architecture he's recommending.
If this all sounds new to you - as it does to me - you should check out the book. I Heart Logs by Jay Kreps isn't a "how to" manual; but, it will help you see a different future. A future in which data is a real-time processing pipeline that any area of your organization can tap into and consume.