Microservices Patterns Part III: Circuit-Breaking

ʂʍɒρƞįł Ҟưȴķɒʁʉɨ (coolsvap)

- Last Updated: December 2, 2024

ʂʍɒρƞįł Ҟưȴķɒʁʉɨ (coolsvap)

- Last Updated: December 2, 2024

This is the third installment in my series on microservices patterns. Part one discussed how event processing allows you to decouple microservices by using a message queue. Part two explored service discovery tools that help you to handle shifting clusters of services. Before we delve deeper into the circuit-breaking microservices pattern, the subject of the final entry in this series (for now), let's first explore a couple of patterns that will help us understand circuit-breaking patterns better.

Microservices Pattern: Timeouts

A timeout is an incredibly useful pattern when communicating with other services or data stores. The idea is that you set a limit on a service's response time. If you don't receive a response within the allowable time, you can fall back on business logic you've written to deal with this failure, such as retrying or sending a failure message back to the upstream service. A timeout could be the only way to detect a fault with a downstream service. However, no reply from a service doesn't mean the service hasn't received and processed the message, or that it doesn't exist. The key feature of a timeout is to fail fast and to notify the caller of this failure.

There are many reasons why this is a good practice, not only from the perspective of returning early to the client and not keeping them waiting indefinitely. It's also helpful from the point of view of load and capacity. Timeouts are an effective hygiene factor in large distributed systems, in which many small instances of a service are often clustered to achieve high throughput and redundancy. If one of these instances is malfunctioning and you connect to it, this can block an entirely functional service. The correct approach is to wait for a response for a set time; if there's no response in this period, cancel the call and try the next service in the list. There's no simple answer to the question of what duration your timeouts should be set to. We also need to consider the different types of timeouts that can occur in a network request. You may have the following timeouts:

Connection Timeout — The time it takes to open a network connection to the server.
Request Timeout — The time it takes for a server to process a request.

The request timeout is almost always going to be the longest duration of the two. I recommend the timeout be defined in the configuration of the service. While you might initially set it to an arbitrary value of, say 10 seconds, you can modify this after the system has been running in production and after you have a decent data set of transaction times to look at.

Microservices Pattern: Back-off

Typically, once a connection has failed, you don't want to retry immediately to avoid flooding the network or the server with requests. To allow this, it’s necessary to implement a back-off approach to your retry strategy. A back-off algorithm waits for a set period before retrying after the first failure. This increments with subsequent failures up to a set maximum duration.

Using this strategy within a client-called API might not be desirable as it contravenes the requirement to fail fast. However, if we have a worker process that's only processing a queue of messages, this could be exactly the right strategy to add a little protection to your system.

Microservices Pattern: Circuit-breaking

We have looked at some patterns like timeouts and back-offs, which help protect our systems from cascading failure in the instance of an outage. However, now it’s time to introduce another pattern that is complementary to this duo. Circuit-breaking is all about failing fast. It's a way to automatically downgrade functionality when the system is under stress.

[bctt tweet="Circuit-breaking is useful #Microservices pattern. Conceptually, it's similar to an electrical circuit breaker. It's all about failing fast and recovering. || #IoTForAll #IoT #APIs @java" username="iotforall"]

Let's consider an example of a front-end web application that's dependent on a downstream service to provide recommendations for services available to a user. Because this call is synchronous with the main page load, the web server won't return the data until it has successfully returned recommendations. Now you have designed for failure and have introduced a timeout of five seconds for this call. However, since there is an issue with the recommendations system, a call that would ordinarily take 20 ms is now taking 5,000 ms to fail.

Every user who looks at services is waiting five seconds longer than usual; your application isn't processing requests and releasing resources as quickly as normal, and its capacity is significantly reduced. Additionally, the number of concurrent connections to the main website has increased due to the length of time it's taking to process a single page request. This adds load to the front-end, which is starting to slow down. If the recommendations service doesn't start responding, then the whole site is headed for an outage.

There is a simple solution to this: stop attempting to call the recommendations service, return the website back to normal operating speeds, and slightly downgrade the functionality of the service's page. This has three effects:

Restore the browsing experience to other users on the site.
Downgrade slightly the experience in one area.
Impact directly the system's business, which is why you must have a conversation with your stakeholders before you implement a circuit breaking pattern.

Let’s assume recommendations increase conversion by 1 percent; however, slow page loads reduce it by 90 percent. Isn’t it better to downgrade by 1 percent instead of 90 percent? This example is clear-cut, but what if the downstream service was a stock checking system? Should you accept an order if there is a chance you do not have the stock to fulfill it?

How Does Circuit Breaking Work?

Under normal operations, just like a circuit breaker in your electricity switch box, the breaker is closed and traffic flows normally. However, once the predetermined error threshold has been exceeded, the breaker enters the open state, and all requests immediately fail without even being attempted. After a period, a further request would be allowed and the circuit enters a half-open state. In this state, a failure immediately returns to the open state regardless of the error threshold. Once some requests have been processed without any error, the circuit again returns to the closed state, and only if the number of failures exceeds the error threshold would the circuit open again.

Error behavior isn't a question software engineering can answer on its own; all business stakeholders need to be involved in this decision. When planning the design of systems, talk about failure as part of the non-functional requirements. Decide ahead of time what will be done when the downstream service fails.