Distributed System Faults

The core types of faults encountered with distributed systems

Traditionally, two general classes of faults are considered in distributed systems: Benign faults and Byzantine faults. On the cloud we also tend to observe another class of fault: gray failures or slow down faults.

Benign faults

Benign faults are those which have clear, detectable outcomes. Often it can be best for the distributed system as a whole if these kinds of faults are converted into fail-stop faults. For example, if a producer component of the system is sending key messages to a consuming system faster than that consuming system can consume, and the buffers are full - it can be a better outcome if that process fails rather than silently dropping messages. Another example is memory limits on a Docker container to allow only so much memory usage by a given container - if we have a process that leaks memory, it is often better for that process to be killed rather than having the entire host have issues due to out of memory errors (or worse, having the Linux OOM Killer select the wrong process to kill).

Byzantine faults

These faults cover a wide array of erroneous behavior, and can be very challenging to deal with. The problem is often made worse since most consensus algorithms do not properly handle these faults - notably Raft, which is gaining popularity in cloud services (Consol, Kafka, Aeron Cluster for example). The application developer is left to solve these kinds of problems.

Much like with benign faults, it is often preferable for a distributed system to do what it can to convert a Byzantine fault into a fail-stop fault. For example, if you have a Raft cluster that has nodes that have diverged in their internal state due to an application defect, you would want to have the Raft cluster fail the moment the current leader has changed, rather than having a new leader take over with an unknown state, producing an invalid or corrupted raft log.

Gray failures or Slow down faults

There are a wide array of causes of these faults - the common theme being that an external (to your application) resource, such as a CPU or network, is operating slower than expected and without raising errors.

  • a disk or storage array hardware issue that results in slow file system I/O
  • a faulty DIMM that results in slow memory access
  • a faulty network cable that results in poor network performance
  • a network that is running slow due to over saturation by other processes.

These are especially common on the cloud or in an environment with hundreds (or more) of servers deployed. As distributed systems involve more and more processes and servers, we are more likely to be subjected to faults that are caused by something slowing down. These can be very complex to deal with, and can lead to impacts or outages in system components not being subjected to the slow down - so tracking them down can take time.


G. Goos et al., “Lecture Notes in Computer Science,” vol 5959, doi: 10.1007/978-3-642-11294-2.
Miguel Castro and Barbara Liskov. 1999. Practical Byzantine fault tolerance. In Proceedings of the third symposium on Operating systems design and implementation (OSDI '99). USENIX Association, USA, 173–186. doi: 10.5555/296806.296824.
Pat Helland. 2021. Fail-fast Is Failing... Fast! Changes in compute environments are placing pressure on tried-and-true distributed-systems solutions. Queue 19, 1, Pages 60 (January-February 2021), 11 pages. doi: 10.1145/3454122.3458812.
Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray Failure: The Achilles' Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS '17). Association for Computing Machinery, New York, NY, USA, 150–155. doi: 10.1145/3102980.3103005.

Change log

  • Added 13 December 2020
  • Updated with slow down faults, 3 April 2021
reading time
3 min read
last updated
review policy
Distributed Systems
--- Views

© 2009-2021 Shaun Laurens. All Rights Reserved.