Emitting Application Events In The Context Of Bulk Operations?

By Ben Nadel

Published 2022-03-11 in Work — Comments (3)

During the lifetime of an application, it's common to emit events when the system is mutated in some way. Propagation of these events might be done through a mechanism like Kafka streams, Redis Pub/Sub, Pusher, or even just an in-memory queue. These events help keep different parts of the system in sync, allowing materialized views to be updated and non-materialized data to be re-fetched. When it comes to mutating a single data-point within the system, I can wrap my head around emitting a single, corresponding event. However, I'm never sure what to do in the context of bulk operations, where a single request may end up changing a multitude of data-points within the application.

Consider an Amazon AWS S3 Bucket as a thought experiment. From what I have read, you can configure an S3 Bucket to emit an event when an Object in the Bucket is deleted. Over time, an S3 Bucket may accumulate millions, maybe even billions of Objects.

Now consider deleting this S3 Bucket. Let that represent our "bulk operation" in this thought experiment. What events should be emitted from the system? Clearly, we should emit some sort of BucketDeleted event because that's "the thing" that happened. But, what about the collateral operations: the fact that inside that Bucket were Objects; and, those Objects are now gone?

Should an ObjectDeleted event be emitted for every Object in the Bucket?

NOTE: I have no idea what AWS actually does in this case - this is just a thought experiment.

It seems unreasonable that deleting a Bucket should suddenly spawn millions, maybe even billions of events. Such a deluge of events could easily overwhelm and cripple a system.

There's also a semantic question to consider: is a "Delete Object" operation really the same thing in the context of a bulk operation? Meaning, should the system differentiate between these two types of events:

ObjectDeleted
ObjectDeletedInBulkOperation

At the very least, having two flavors of "delete" event would allow remote systems to change the way they react to those events.

Again, knowing nothing about how AWS S3 actually manages events, I like that S3 represents absurd scale. Sometimes, it's easier to get at the truth when you can't wrap your head around the size of something. And, at least for me, when I think about the relationship between AWS Buckets and Objects, it seems crazy to even consider emitting Object events when a Bucket is deleted.

And, to create a generalization from this thought experiment, it seems crazy to emit "child" deleted events when a "parent container" is deleted. It just doesn't feel like it scales; and, it just doesn't feel like it is semantically correct.

Considering Database Replication

As I was writing this, it occurred to me that database replication might be fertile ground for further consideration. If I have a read-replica database that is being synchronized with a primary database, what happens when I run DROP TABLE on the primary? Does that get replicated as a DROP TABLE operation in the read-replica? Or, does the replication process have to run a DELETE FROM operation for every row in the dropped table?

I don't really know much of anything about database replication; but, it seems absurd to manage the replication though anything other than the single DROP TABLE "event".

Short link: https://bennadel.com/4226

Reader Comments

JesterXL Mar 11, 2022 at 3:26 PM

4 Comments

This is nuanced, sure, but typically a team will create a bucket for their app. This means it's not shared with others. In network drive days at college, hundreds of students would, intentionally, share the same drive to transfer files amongst the many computers. That was desirable back then. For projects, though, you typically don't want to share for a variety of reasons:

we're wiring up events from that bucket like you mentioned
we may need to delete and re-create the bucket
we may have different permissions, rules, and folder structures for our project than others.
We don't want to get events for things we don't care about it.

This significantly reduces the events you could get from thousands to maybe 2, depending upon who/what is uploading/deleting files from the bucket.

Additionally, yes, the bucket could emit thousands of events, but the key is "where"? If it's to AWS Lambda, you should have about the Lambda that's listening to that deleted event to have a concurrency of 1,200 - 1,500. That way, if 1,000 are deleted at once, you may only need about 800 Lambdas running at the same time as some of the instances could be re-used but no guarantee. If it's an ECS/EKS cluster, as long as you have enough containers/pods to handle 1,000 requests at the same time, cool.

However, if this happens over time, meaning, 1,000 objects are deleted per day over time, then you need wayyyy less resources. If you have no idea, the better thing to do is an SQS queue. That way, the events can come in as fast as they need to, but those who are supposed to deal with the events can take their time. 1 at a time? Cool. 10 at a time? Cool. Wire that to SNS to fan out or event Kinesis/Kafka so you can deal with as many at a time? Cool.

There are other options, too. You can do basic filtering to only care about certain file prefixes, suffixes, filenames, or folder paths so you don't get all events and then have to immediately ignore it.

Lastly, for Cloud Operations, this is the opposite. They sometimes have to deal with what you mentioned where they'll get thousands of events because they're the ones watching it. There are options there, though, such as using CloudTrail instead which has all the events... like ALL the events. Since they're streamed and stored, you can take your time downstream processing those events, checking for bots or fraud, and other security things, regardless if it's 10 or 10,000 events.

Ben Nadel Mar 11, 2022 at 5:26 PM

16,058 Comments

@JesterXL,

So, the S3 bucket was just intended to be an example of a parent-child relationship in which a bulk operation could be executed. But, to your point about not sharing a bucket with other projects, I 100% agree with that. A bucket, like a database, should not be shared across services as this creates insane coupling that only gets worse over time.

And, also to your point, I agree that internally to a project, you can have whatever events make your life easier; and, make your application more effective. If you want a "bulk event" internally, do it. If you want a "granular event" internally, do it. Whatever gets the job done.

Where I start to pump the brakes on this is when the bulk vs. granular event is used to communicate across services. Sometimes, I think we start to emit events one service simple because another service needs it as a "gap fix" for missing data on their end.

What I mean is, using the S3 bucket as an example, let's say there was an external project that was storing data about the S3 Objects. But, let's imagine that this external service didn't store the bucket name that those objects where in. So, one project goes to delete the bucket and emit a "Bucket Deleted" event with the bucket name, and the external service is like, "Whoa, I don't what's in the bucket, please give me all the granular events so that I can update my local data".... and that's where I'm like whoa! Don't make your problem, my problem.

So, I guess part of my thinking here is that I don't want to use events to make up for missing data. That becomes a maintenance nightmare.

JesterXL Mar 12, 2022 at 5:29 PM

4 Comments

Yeah, I agree about the bulk. We we're using Kakfa at one of my jobs, and it'd stream all day about customer banking transactions being created, modified, and deleted. Similar to a Git log, it'd process the transactions in order. They'd come in bursts of 1,000 because the upstream was a Mainframe batch job, so it couldn't trickle all day sadly.

However, Kafka has a really cool backpressure strategy like SQS where you can put as many event as you want, as fast as you want, and the downstream can take their time processing the messages. If something crashes, you can just resume where you left off. In that case, we needed to know about all data, including the hundreds of deletes if some other parent transaction was deleted.

But I hear you, setting up a queue to protect your downstream service requires infra setup, monitoring it, some kind of redrive policy (we missed 5 events because we crashed, let's go reprocess those...), it's a tradeoff for sure vs. "Dude, please don't send me 1,000 events all at once because you deleted something".

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.