Resilience with Hystrix

Posted by

Hystrix logo

One of the most interesting sessions I attended this year at JavaZone was "Resilience with Hystrix" by Uwe Friedrichsen. If you do some kind of work related to (or are interested in) distributed systems, I highly recommend watching that talk. It is available in Vimeo. But not just yet.

What does resilience mean and what is Hystrix?

Uwe defined resilience as:

the ability of a system to handle unexpected situations. In the best case without the user noticing it. In the worst case with a graceful degradation of service.

In other words, a resilient system continues to work as well as possible when one or more subsystems fail or misbehave.

Hystrix is an open source library created by Netflix. In their own words, "It is designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable."

How it works

The only requirement for using Hystrix in our projects is that we need to organize our selected operations as commands. That means creating a class per operation that extends HystrixCommand and implements a run() method. But of course the command class can be just a wrapper around an existing service.

This is how the classic Hello World looks:

There are different reasons why a command may fail. It can be a remote server not responding, or not responding on time. It could be a thread pool that is full. In most cases, retrying the request will only put more pressure on the failing resource and will make things worse.

Hystrix is based on the concept of circuit breaker. It monitors the execution of the different commands and calculates the circuit health based on the execution history. Once a certain failure threshold is reached, the circuit opens and subsequent executions are short-circuited until a recovery period elapses, and then the circuit is closed again after health checks succeed.

This may sound simple but under the hood it implements a relatively complex workflow.

hystrix-flow-chart-original

In addition, the decission to open or close the circuit is not a simple failure counter. It is based on the number of failures, but it also keeps tracks of different measures and statistics in the recent history.

Luckily all this complexity is handled by Hystrix for us.

circuit-breaker-1280

 

The HystrixCommand class has other methods that can be optionally overriden. Like the getFallback() method, which is called when the command fails or the circuit is open. These methods allow us to implement a number of interesting patterns; for example, returning a response from cache if we are not able to return fresh data, or maybe access a secondary resource as backup when the main one is down.

This is just a very simple introduction to Hystrix. The library provides support for caches, thread pools, request timeouts, and configuration of all kinds of parameters at multiple levels. There is also a growing number of plugins for customizing the default behaviours. And last but not least, there is the Hystrix Dashboard - a really nice and functional interface that enables realtime monitoring of Hystrix metrics.

hystrix_dashboard

I hope this short article inspired your curiosity on both resilience and Hystrix. If that is the case, now would be the moment to go and watch that video.

Comments