Anyone who develops distributed systems knows that there are many issues to resolve before reaching stability. Akka does not entirely avoid these issues and, while many can be handled through configuration or a few lines of could, some problems require extra leg work. One major problem is the split brain. This article explains what a split brain is and examines a solution that does not involve paying Lightbend.
Split brains are the cell division of the concurrent programming world. When different nodes on a cluster cannot reach one another, they must decide how to handle the nodes they cannot reach. Without proper configuration in Akka, the nodes merely assume the other nodes are down and remove or gate them. The previously single cluster has divided into two separate clusters.
In the world of concurrent programming, the question is not whether a split brain will occur but when. Networks crash. Hardware fails, needs to be upgraded or updated, or requires replacement every so often.
Unfortunately, there is no free way to automatically handle the problem in Akka. Auto downing, the only freely available method for resolving the unreachable state, is actually not a solution to the split brain problem and will result in the separation of nodes into different clusters.
The following graphic on cell division illustrates the split brain problem. Notice how the two cells are completely independent of each other and yet perform the same role.
Strategies for Resolving a Split Brain
Lightbend, the company behind Akka, lays out several strategies for resolving a split brain. In a nutshell, they are:
Unfortunately, Lightbend requires a paid subscription to access implementations of these strategies.
Custom Majority Split Brain Resolver
While the folks behind Akka do not provide free solutions to the split brain problem, they do provide the tools to implement one of the aforementioned strategies.
The following code utilizes the majority strategy:
The preStart method requests the receipt of messages regarding reachability in the cluster. Once the Unreachable message is caught, the code stores the relevant actor reference in a sequence of unreachable nodes and schedules the removal of all unreachable nodes after a period of time if the current set of nodes contains the majority of its kind. After the pres
A split brains is a serious problem. We reviewed ways to solve the issue and presented a free solution using the majority strategy.