Thinking About Tracking Requests And Errors Across Micro-Services

By Ben Nadel

Published 2015-07-23 in ColdFusion, JavaScript / DHTML, Ruby, Work — Comments (6)

Caveat: This is all just me "thinking out load." Take that as you will.

As we start to break apart our monolithic applications, destructuring them into a collection of independently scalable micro-services, things simultaneously become both more simple and more complex. While each individual service becomes smaller and more cohesive, understanding the flow of requests across decoupled micro-services becomes far more frustrating. To deal with this, people tend to create some sort of unique identifier that can be attached to the flow of requests in order to provide a link between log entries generated across different services and machines.

While I haven't dealt with this personally - yet - having to manage this request-oriented unique identifier has me thinking about where you create boundaries within your application (an "application," in this context, being an isolated service). And, more specifically, how you express errors in your control flow - errors that will lead to log entries.

I can't help but think about several of the "Uncle" Bob Martin presentations that I've watched in which he examines the ways in which the nature and structure of a web application can overshadow and obfuscate the intent of the underlying business logic. And, I think if you have to track a unique identifier across requests, it could be very easy to fall into this trap - letting the concept of "request" bleed into every aspect of your micro-service.

Each application, therefore, has to contain two distinct parts: the part that manages requests and the part that fulfills business use-cases. This latter part needs to be a blackbox in so much as it has no concept of a "request"; its only concern is business logic. In the constraints of this structure, errors that occur inside the "blackbox" have to propagate up to the request-oriented aspects of the application where they can be logged in association with the request and the unique identifier.

When an error happens in this kind of architecture, it cannot go gently into that good night - it has to explode; and, do so with as much information as makes sense for debugging. Inputs, context, validation problems, etc.. This way, when the error (or the rejected promise) comes back to the request-oriented aspect of the application, the logging component can be fed a useful amount of data.

Let's think about this blackbox in concrete terms for a minute. Imaging that you have a micro-service that pulls messages off of a queue, processes them, and then pushes new messages onto another queue (presumably to be consumed by another micro-service):

Step 1: Pull message off of queue.
Step 2: Process message.
Step 3: Push message on to queue.

In this workflow, which parts need to know about the unique request identifier? Certainly steps 1 and 3. Step 1 needs to know to grab the request identifier out of the incoming message and Step 3 needs to know to include the request identifier as part of the subsequent outgoing message.

Step 2 will also need to know about the request identifier in terms of error logging (for errors that bubble-up); but, Step 2 is also where we cross over into the blackbox. And, once we cross over into the blackbox, nothing there should need to know about the request. This means that the blackbox should never be responsible for pushing messages onto a queue as that queue message would inherently need to know about the request. As such, control flow will always have to return back to the request-oriented aspect of the application where errors can be logged and the workflow can move forward.

NOTE: You can always log additional information inside the blackbox, such as timing metrics and other error data; but, those log items will not be associated with a specific request.

Right now, this is all just theory in my head as I have not yet had the opportunity to work on the teams that deal with this kind of stuff. But, I like the idea of thinking in terms of constrained responsibilities as a means to tease apart an application architecture. That said, they say that no plan of attack ever survives first contact with the enemy. So, we'll see.

Short link: https://bennadel.com/2873

Reader Comments

Mark Gregory Jul 24, 2015 at 12:22 AM

10 Comments

Sounds like some SOA principles are organically bubbling their way in to your architectural thinking. Even your diagram reminds me of a high level Enterprise Service Bus diagram.
Its kind of a heady read at times, but "SOA Principles of Service Design" by Thomas Erl was a real eye opener for me. Made alot of sense, only without the assumption that everything has to be SOAP based.
The use of Beans (just CFCs with synthesized getters/setters) as a physically decoupled service contract found its way in to a lot of my designs after reading that book.
Very interested to see where you go with this line of thinking, fun stuff!

Andrew Siemer Jul 24, 2015 at 9:50 AM

2 Comments

I have read through your post a couple of times - very nice. I enjoyed the graphic quite a bit.

I guess I have an issue with the idea that the request should have any more of a life cycle than it does beyond the request itself. I don't think this is as much a microservices issue as it is a synch vs. asynch issue. When I think about a user placing an order somewhere like Amazon the user clicks submit on the order and is told thank you. At some undisclosed time after submitting the order, the order is charged, perhaps recharged, etc. If something happens an email is sent. If the order is processed an email is sent. Once in this world you have a correlation ID that is passed around to say that this work belongs to this user around this order. This is for more than just managing exceptions that might happen in the backend.

In the world where a user has performed a task, and some downstream process has failed, in the RPC world of distributed transactions all of that *should* be able to be captured and rolled back up the stack for the user to see/consume. However, there are few cases where this makes sense.

In my world (more .net less CF) I tend to use APM products like NewRelic that allow me to track the notion of a request life cycle from javascript back to the actual machines that processed the request back to the database that had some hand in the process.

Fun thoughts! Thank you.

Ben Nadel Jul 27, 2015 at 8:10 AM

16,256 Comments

@Mark,

At work, we are trying to move into more a distributed service architecture; but, to a good degree, we are learning as we go. So, this stuff is really interesting to think about. Part of what got me thinking about this was the user's IP Address. Imagine that there is some function that executes some sort of rate-limiting logic. For example, a "password reset" function that blocks the given IP after a certain number of reset requests. At some point, there has to be a method that's like:

.sendPasswordReset() throws "Forbidden.RateLimit"

Now, in order for this to perform the rate-limiting, it has to have access to the user's IP address. But, we don't want to it just magically pull that value out of the air, which is entirely possible in ColdFusion with the CGI object (or getHttpRequetsDatat() for X-Forward-For). So, we have to pass the value into the method:

.sendPasswordReset( clientIpAddress, ... )

But where does it come from in the calling context? At some point, you have to cross the barrier of "Web Application", which has information about the incoming request, into "Business Logic", which doesn't necessarily know about the request, but still needs to perform rate limiting.

Thinking about this barrier has been helping me think about how parts of the application should be quarantined. And, has really helped to create "code smells" where you suddenly realize, "Whoa, this code should not be able to make those assumptions!"

"SOA Principles of Service Design" sounds interesting, I'll try to give it a look. I'm currently in the middle of "Release It!: Design and Deploy Production-Ready Software", which I had to stop for a while as it was giving me too much anxiety about work :D

Ben Nadel Jul 27, 2015 at 8:20 AM

16,256 Comments

@Andrew,

You bring up a great point. I was definitely thinking about this in terms for "Requests". But, you are right, there is more to it than that. Stepping back, and trying to get at the more accurate picture, I'd say it's more about "Workflow" than a request. At some point, there has to be coordination of workflow that may or may not extend beyond the lifecycle of the current request.

For me, it's this "workflow orchestration" that straddles the border of the blackbox. So, going back to the message queue though, for a second. You could theoretically have a few different approaches to orchestration (pseudo code):

orchestrate():
- message = getNextMessage()
- processMessage( message ) // Eventually pushes onto another queue.

Or, you might have something like this:

orchestrate():
- message = getNextMessage()
- result = processMessage( message )
- pushOntoNextQueue( result )

And this is where I've been trying to think pretty hard - what is the responsibility of the "processMessage()" method? Should it be pushing the result onto the next message queue? Or, should it simply be returning a result and letting the orchestration manage the movement to the next phase of the workflow?

This latter option "feels" cleaner to me as it decouples the processMessage() implementation for the worfklow implementation. This would make it the responsibility of the orchestration service to manage the "requestID" or correlation ID. And, the processMessage() would contain nothing but message processing details.

Andrew Siemer Jul 28, 2015 at 2:45 PM

2 Comments

Speaking to the orchestration concept. Generally you will be using a "bus" not just a queue. There are a few ways to think about this. Greg Young would tell you to think about your process in terms of "remove the technology from the conversation - think if I had a piece of paper that tracked the problem".

So in that case, in a distributed system you might have a simple process. When process A pulls a message, it can simply pass it to a queue that we know process B is watching. Then it can go to C. That can be a very easily modeled way of representing a business problem. And in that case that is likely the most efficient way of building the app...initially. The problem with this is that each process knows that something is waiting on the other side. This tight coupling can cause an app to become brittle over time.

Especially when you are modeling a check out process. If a customer comes up to you at McDonalds and orders a hamburger. You could create an order that has the hamburger on it. You hand that paper order to the burger guy. They make the burger. When the burger is done they put it in a bag and staple the order to the bag and pass it back to you. You then bill the customer. And pass the burger to the customer. Simple?

But as soon as we want to do anything more that system starts to break down. Let's say the order is for a #1...a drink, fries, and a burger. The model still works. Order taker makes order, passes to burger guy, who passes to fry guy, who passes to drink guy, who passes to the order taker, who passes to the customer. This works. But now while the system is efficient and simple to model and build. It introduces a clog in the system. Because each process has to pluck from a queue and move to the next queue - processes that happen faster could be happening in parallel. Also, a break in the chain (out of fries) has to have compensating logic built in. But each process would have to have that compensating logic build in. Complexity has now gone up.

In this case you would more likely want a saga where an additional process is monitoring where the order is in the process. This is also easy to model but requires some form of a framework or ESB to use to make it happen.

Or you might try a routing slip concept where the order is put in. Event are sent to each station. But as each step is completed a "routing slip" is marked as a step having been completed. Each time the slip is updated the order is checked. This is closer to the McDonalds model. The last person to bring an item up for the order checks to see if the order is complete...then calls the customer to get their food.

For the password logic issue - this likely isn't a distributed processing problem. This is in the process as it is important to be real time. I wouldn't distribute the process unless you had crazy perf reasons to do this. You could do it...but yes the IP would shuttle along with the initial request. That information would be bottled up. The calling client would hang around to get the result via perhaps a websocket or pulling process. And once the result was completed - the action to block or move forward would be made. But you are adding a lot of complexity that should just be done as efficiently as possible on the edge of your system.

I too love these types of conversations. Happy to chat about this topic all day!

Ben Nadel Aug 5, 2015 at 7:48 AM

16,256 Comments

@Andrew,

I really like the sound of a Saga (though I am not exactly sure what the implementation would be). The reason this connects with me is that this is often how I think about code within a monolithic application. When a user creates an action, I think about an "orchestration" component which works to ensure that all the aspects of a particular workflow take place (ie, changing the domain, sending out emails, logging data, etc.). As such, this sounds something like the "Saga" you describe; though, maybe I am completely misunderstanding.

That said, could the Saga still push things onto different queues? I ask because, part of what the power of a message queue is supposed to do (the way I understand it) is to allow different parts of the system to be independently scalable to handle different kinds of processing workloads. So, even if you have some sort of centralized logic managing the overall workflow, you might still want it to push/pull from queues to make sure that sub-tasks can be scaled as needed?

Or am I just getting it all mixed up in my head?

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.