Split brain refers to an error condition in clustered replicated state machine based systems that must be solved in order to guarantee correct operation.
The condition can occur if the cluster operates with a strong leader, and a network partition has happened that disconnects the cluster nodes from each other, but clients are able to correctly communicate with all cluster nodes. Without a protocol to solve which cluster node is the leader, each node could decide that since the other one is unavailable then it must become leader. Split brain is especially dangerous when we are running a Replicated State Machine, since we may have clients modifying the states of several replicated state machines in parallel, thus corrupting the system state.
There are several solutions to prevent split brain conditions:
- Invest very heavily and build a bulletproof network. This approach was more common in decades past, but is in theory a valid solution. In the cloud this would likely be impossible.
- Use an external leader election process, which all cluster nodes can communicate with and have it elect the leader. Services such as Hashicorp Consul or Apache ZooKeeper can offer this functionality. When nodes lose access to this service they must stand down and reject further requests.
- Run a consensus algorithm in the cluster, for example RAFT. Raft uses Quorum to elect a leader - this ensures that nodes in a minority network partition cannot become leader.
- Added 13 December 2020
- 🌱reading time
1 min readpublished
continuousTopicsDistributed Systems--- Views