“Event processing” is a model that allows you to decouple your microservices by using a message queue. Rather than connecting directly to a service, which may or may not be at a known location, you broadcast and listen to queued events, through Redis, Amazon SQS, RabbitMQ, Apache Kafka, and a whole host of other sources.
The “message queue” is a highly distributed and scalable system. It should be capable of processing millions of messages so that we don’t need to worry about it being unavailable. At the other end of the queue, there will be a “consumer” listening for new messages directed to it. When it receives such a message, it processes the message and then removes it from the queue.
Due to the asynchronous nature of the event processing patterns, there needs to be a requirement to handle failures in a programmable way.
Event Processing With At Least Once Delivery
One of the first, basic sync mechanisms is to request for delivery. We add the message to the queue and then wait for an acknowledgment (ACK) from the queue to let us know that the message has been received. Of course, we wouldn’t know whether the message has been delivered, but receiving the ACK should be enough for us to notify the user and proceed.
There’s always the possibility that the receiving service cannot process the message. Potential causes include a direct failure, or a bug in the receiving service, or it could be that the message that had been added to the queue isn’t formatted in a way that the receiving service can understand. We need to deal with both of these issues independently.
It’s not uncommon for things to go wrong with distributed systems. That’s an essential part of the logic behind microservice-based software design.
Within the scenario above, if a valid message cannot be processed, one standard approach is to retry processing the message, normally with a delay. It’s important to append the error every time we fail to process a message so that we construct a history of what went wrong, which can provide really useful insight during post-incident analyses and reconstructions. Such a history also enables us to understand how many times we have tried to process the message because after we exceed this threshold, we don’t want to continue to retry. We need to move this message onto a second queue or a dead letter queue, which we will discuss next. It’s important to set up such history generation mechanisms and to build into your microservices architecture a way of handling errors such that they don’t hold up the queue or create cascading downstream impacts.
Debugging Failures with a Dead Letter Queue
It’s common practice to remove the message from the queue once it is processed. The purpose of the dead letter queue is so that we can examine the failed messages on this queue to assist us with debugging the system. Since we can append the error details to the message body, we know what the error is and we know where the history lies should we need it.
Working with Idempotent Transactions
While many message queues nowadays offer At Most Once Delivery in addition to the At Least Once, the latter option is still the best one when dealing with a large throughput of messages. To deal with the fact that the receiving service may get a message twice, that service needs to be able to handle that repetition with its own logic. One of the common methods for ensuring that the message isn’t processed twice is to log the message ID in a transactions table, recording the messages that have already been processed and a parameter stating whether it will be discarded.
Working With the Ordering of Messages
One of the common issue while handling failures with retry is receiving a message out of sequence or in an incorrect order, which will dump inconsistent data into your database. One way to avoid this issue is to leverage that transaction table and to store the message dispatch_date in addition to the ID. When the receiving service gets a message, it can not only check if the current message has been processed it can check that it is the most recent message and if not discard it.
Utilizing Atomic Transactions
This is the most common issue I encouter when transforming legacy systems into microservices. While storing data, a database can be atomic—that is, all operations occur or none do. Distributed transactions don’t give us the same kind of transaction found in a database. When part of a database transaction fails, we can roll back the other parts of the transaction. By using this pattern, we would only remove the message from the queue if the process suceeded so that when something fails, we keep retrying. This gives us transaction that will ideally become consistent eventually even if the initial state of the transaction is incomplete.
Unfortunately, there’s no one-size-fits-all solution with messaging. We need to tailor the solution that matches the operating conditions of the service.
I’ll continue this discussion in tomorrow’s post, which dives deeply into “service discovery.”