Traditionally, two general classes of faults are considered in distributed systems: Benign faults and Byzantine faults. On the cloud we also tend to observe another class of fault: gray failures or slow down faults.
Benign faults are those which have clear, detectable outcomes. Often it can be best for the distributed system as a whole if these kinds of faults are converted into fail-stop faults. For example, if a producer component of the system is sending key messages to a consuming system faster than that consuming system can consume, and the buffers are full - it can be a better outcome if that process fails rather than silently dropping messages. Another example is memory limits on a Docker container to allow only so much memory usage by a given container - if we have a process that leaks memory, it is often better for that process to be killed rather than having the entire host have issues due to out of memory errors (or worse, having the Linux OOM Killer select the wrong process to kill).
These faults cover a wide array of erroneous behavior, and can be very challenging to deal with. The problem is often made worse since most consensus algorithms do not properly handle these faults - notably Raft, which is gaining popularity in cloud services (Consol, Kafka, Aeron Cluster for example). The application developer is left to solve these kinds of problems.
Much like with benign faults, it is often preferable for a distributed system to do what it can to convert a Byzantine fault into a fail-stop fault. For example, if you have a Raft cluster that has nodes that have diverged in their internal state due to an application defect, you would want to have the Raft cluster fail the moment the current leader has changed, rather than having a new leader take over with an unknown state, producing an invalid or corrupted raft log.
There are a wide array of causes of these faults - the common theme being that an external (to your application) resource, such as a CPU or network, is operating slower than expected and without raising errors.
- a disk or storage array hardware issue that results in slow file system I/O
- a faulty DIMM that results in slow memory access
- a faulty network cable that results in poor network performance
- a network that is running slow due to over saturation by other processes.
These are especially common on the cloud or in an environment with hundreds (or more) of servers deployed. As distributed systems involve more and more processes and servers, we are more likely to be subjected to faults that are caused by something slowing down. These can be very complex to deal with, and can lead to impacts or outages in system components not being subjected to the slow down - so tracking them down can take time.
- Added 13 December 2020
- Updated with slow down faults, 3 April 2021
- 🌱reading time
3 min readpublished
continuousTopicsDistributed Systems--- Views