Failure Detection in Cloud Hosted Trading Systems

An overview of complexities and recommendations for failure detection in cloud hosted trading systems

Detecting failures in a trading system can be a complex task on internal, tightly controlled hardware. Building the trading system for the cloud brings a whole host of new complexities in detecting failure. This article takes a look at the challenges and some potential solutions.

Abstract Trading System Model

Before going further, it's worthwhile defining an abstract model of a trading system that will be referred to in this article.

This representation includes:

  • The internal trading system. It is assumed that at deployment time there are tens of services operating partially synchronously1 during normal system operation
  • Observers - components that external to the trading system processes and monitor the system with probes and quantitative reports
  • Reactors - components that act on an output of the observer, and which will cause some automated change on the trading system.
  • External egress - this is some form of structured output from the system, and will vary depending on the boundaries of the particular trading system. This could include output destined for internal billing systems, trade settlement systems, risk systems etc.
  • External ingress - this is some form of external input to the system, and will also vary depending on the particular trading system. Ingress sources can include internal systems such as risk systems sending credit limit updates, or external systems sending in market data etc.
  • External clients - these are clients using an API provided by the trading system, such as a FIX gateway. These clients are not controlled by the trading system developers.
  • Internal clients - these are clients using a private API provided by the trading system, such as a UI. These clients are fully controlled by the trading system developers but are typically deployed on hardware not owned or controlled by the trading system developers.

In addition to this, there are several key characteristics we need to keep in mind:

  • Trading systems are not large scale systems - we typically cannot apply Netflix or Amazon like approaches with such relatively small fleets of servers (typically in the tens, rarely over one hundred).
  • Trading systems are typically subject to well-defined SLAs.
  • Correctness is typically critical to trading systems. While trades can be canceled or corrected after the fact, it's far more desirable to have a system that behaves correctly under most scenarios.

Types of failures considered

This article will look at the following types of failures, all within the context of a partially synchronous1 system:

  • Process crashes
  • Network outages
  • Misbehaving processes
  • Omission failures
  • Gray failures

Process crashes

Process crashes can be surprisingly hard to detect reliably, and has been the subject of extensive academic research. This research has focused on a component called a Failure Detector. A Failure Detector is an abstraction which seeks to answer the question 'has a process crashed?', as perfectly as it can - which in a partially asynchronous is not very perfect at all. These typically work in two ways - heartbeats and pings.

  • With a heartbeat, Process A will send a message to Process B informing it is still alive.
  • With a ping, Process A will ask Process B if it is still alive, and if Process B is still alive, Process B will respond.

It sounds simple enough, but there are several failure scenarios that complicate matters, three of which are shown below:

The scenarios are:

  • Process A sends a ping to Process B. Process B is alive and operating correctly, however, it responds after Process A's ping timeout. Process A now suspects Process B is dead, even though it is just running slow. This is not uncommon in cloud environments.
  • Process A sends a ping to Process B. Process B is alive and responds correctly within timeout, but crashes immediately afterwards. Process A now suspects that Process B is alive, when it is in fact dead.
  • Process A sends a ping to Process B. Process A's message never reaches Process B since there is a network partition between the two. Meanwhile, Process C successfully pings Process B. Now we have Process A that suspects Process B is dead, while Process C suspects it is healthy. This is again not uncommon in cloud environments.

These complications show how the answer from a failure detector can never be perfect. So we have to accept working with an imperfect answer, and we must in turn be very careful of using a reactor too quickly. There are many approaches to building failure detectors, and they all have different trade-offs in terms of completeness and accuracy. If your trading system already makes use of a consensus algorithm (for example, RAFT), then at least some of your failure detectors are built in already.

Network outages

Network outages can be very challenging to understand, and in some cases will be hard to distinguish from process crashes. Example network outages include, all of which are not uncommon in cloud environments:

  • an outage that only impacts communications between a subset of components, as per the third example in process crashes above.
  • an intermediate process (for example, a load balancer) outage or issue could reduce the bandwidth available between components or dramatically increase response times, resulting in technically working, but slower than required network traffic.
  • a hardware outage or capacity issue that results in a large volume of missing packets on a network link, requiring that the system must retransmit missing data, resulting in reduced network throughput.

Misbehaving processes

Processes which deviate from expected behavior in some way can cause outages of varying severity, and in some conditions, automated reactors doing their job can make matters worse. Consider the following examples:

  • a process has a memory leak defect. If you're running on Linux, without control group limits on process memory, and you have the Linux OOM Killer turned on, you may have an innocent but vital process killed while the problematic process continues to consume memory.
  • a replicated state machine hosted within a RAFT cluster has a determinism bug. Different nodes now have differing views on the state of the world.
  • a process has a misconfiguration resulting in a crash after receiving a number of requests. Clients of the process must retry their actions, reducing system throughput. A reactor notices the crash and automatically restarts the process. The process then proceeds to crash again after a given interval, dragging down overall system throughput when it would have been better to completely remove the misconfigured process from the system.

Omission or out of range failures

These failures incorporate any issue that results from ingress or egress missing data, or being outside expected ranges. Examples may include:

  • a market data vendor has a fault, and instead of sending data updates every 100ms, sends them every 10 minutes. The data is still valid, and the link is operational, however the data volumes are not at the expected level
  • a misbehaving algorithmic trading client starts sending in trade requests at a level far beyond their agreed volumes. Despite the vast bandwidth available on the cloud, this can impact if private VPN links are involved, as is typical for trading systems.
  • an observer process fails, and your monitoring system stops receiving data from a host. Now your alerting system is either stale and lying to you, or is out of action.

Gray failures

Gray failures happen whenever there is a disconnect between what a client perceives, and an observer observes.

There are several reasons for why this disconnect may arise:

  • A failure is invisible to the observer - for example, the code or infrastructure path taken from the observer is different to that taken by the client.
  • Infrastructure redundancy design leads to an unstable network from the client perspective. This has been an issue with cloud operators historically.
  • A reactor invokes an action on a component that does not actually do anything to help in the particular problem, but instead leads to a cascading failure

What can be especially painful with gray failures is when the failure is due to the interplay of two or more components. In this case, the failure is typically highly complex, and the cause is so unclear that different teams end up blaming each other. This gets more complex in the cloud as you cross one or more organizational boundaries.

Approaches to the problems

Standard approaches - such as operating system level and process log monitoring - are not discussed below, and are assumed to already be in place. This monitoring is a vital tool in the toolkit to detect failures in trading systems, and should be carefully applied. One particular challenge does arise though with application level metrics - sometimes the capturing and transmission of application level metrics can negatively impact the performance of the trading system.

Avoiding gray failures

One of the more common approaches to avoiding gray failures in a trading system is to invoke real business logic, as used by clients. This typically requires partitioning of a trading environment into 'production' and 'continuous verification' logical partitions. These partitions both exist at runtime within the same process and hardware, and are coded such that the two partitions can never interact, and the 'continuous verification' partition never results in any trade settlement.

The trading system's business logic is then exercised using the official trading system APIs by a continuous verification client. This continuously exercises the system exactly as a client would, performing all the key trading activities. If the system notices errors or any slow performance it can raise alerts for the operations team.

In addition to this business logic check, the continuous verification client can cause an gossip driven inspection of all internal state and connectivity within the platform. A command is sent to the platform to perform a self test. This eventually results in every component running a test against its internal state and against all of its dependencies. The dependencies in turn check their internal state and all of their dependencies. A unique identifier is applied to an inspection request to prevent endless checks. Timeouts are used to detect suspect processes. As results come in, they are reported back to the caller, and eventually a picture emerges of the system health. Note that this approach requires that the system has some self awareness as to the processes making up the system at the time a check is started. A side effect is that this can help detect network outages that only exist from process to process.

Finally, internal clients should send in any errors observed to a logging endpoint that is ideally on a different path to the primary communication channel(s). Thus, if there is a problem with connectivity to the primary channels, alerts may still be raised.

While these three approaches cannot prevent all gray failures, they can significantly reduce their occurrence.

Detecting failed processes

A combination of both simple heartbeats and outsourced heartbeats can give a somewhat accurate view on the state of processes for each component in the system.

  • simple heartbeats operate as expected - each process sends heartbeats to all of its dependencies. If a process does not reply, it is marked suspect and outsourced heartbeats are requested from other known processes.
  • outsourced heartbeats are simple heartbeats performed on behalf of another process. If we consider that the direct heartbeat failed between Process A and Process B, Process A could then request that Process C and Process D heartbeat Process B, and return the results.

The outsourced heartbeats approach can give some insulation to network outages. In the example above, if Process C and Process D responded to Process A that Process B was alive, then it can assume that there is a network outage from itself to Process B. It can then reroute requests, and keep on attempting heartbeats in the background for Process B until further information comes in.

Out of range failures

Exceeding capacity

By building all gateway components with strict per connection resource and cost monitoring along with automatic disconnections, issues resulting from misbehaving clients can be reduced. A common technique for this is making use of cost controlled rate limiting. For example, if the trading system exposes a particularly expensive operation to clients - such as retrieving a lot of historical data - apply a high cost. If another action, like submitting a bid to a CLOB is very cheap, then assign a very low cost. Then, limit the rate based using a per client cost budget within a given time window. If any client exceeds its cost budget, disconnect it.

Lower than expected volumes

This can happen be applied to many parts of the system. For example, if there are typically 500 users logged in by 10am, and yet only 20 are logged in on a trading day, then there could be an issue worth investigating. Other cases can require windowing functions - for example, if you're expecting 1,000 updates from a market data vendor every second, however the last minute has averaged only 200 updates a second, then raise an alert.

Misbehaving processes

A very challenging set of problems lie in this area. From an external point of view, the 'continuous verification client' discussed above can be very helpful to keep track of correct operation of the system as a whole.

If you're running a clustered replicated state machine based trading system (e.g. for the matching engine), performing continuous out-of-band verification of the replicated state machines' internal state can reliably detect determinism issues. There are several approaches to this, and will vary according to how the trading system is built. If any determinism issues are found, then the only safe course of action is for a diverged cluster node to shoot itself in the head if a leader election happens, and the diverged node is selected as the leader. This may lead to a complete outage (if for example 2 of 3 nodes detected a determinism issue), but this is a far lesser evil than a trading system in a bad state.

Overzealous reactors

Automated reactors should be used very carefully, with alerting and counters in place for every event. If a particular reactor is restarting what it believes to be a failed process continuously, then it is likely causing more harm than good. Verify the behavior of all reactors once they've performed an action, and apply a limit to the number of times a given reactor can perform the same activity within a given time window.

Change Log

  • First draft 4 April 2021


Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray Failure: The Achilles' Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS '17). Association for Computing Machinery, New York, NY, USA, 150–155.doi: 10.1145/3102980.3103005.
Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (March 1996), 225–267. doi: 10.1145/226643.226647.
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: amazon's highly available key-value store. InProceedings of twenty-first ACM SIGOPS symposium on Operating systems principles(SOSP '07). Association for Computing Machinery, New York, NY, USA, 205–220.doi: 10.1145/1294261.1294281.
Shihui Song. 2014. Survey on Scalable Failure Detectors.Stanford.
Christian Cachin, Rachid Guerraoui, and Lus Rodrigues. 2011. Introduction to Reliable and Secure Distributed Programming (2nd. ed.). Springer Publishing Company, Incorporated.
Dimos Raptis. 2021. Distributed Systems for Practitioners.


  1. Fully asynchronous systems have no timing assumptions built in - in other words, nothing within the trading system is monitoring that a specific piece of data is processed within a given timeframe. Fully asynchronous interactions are rarely found within a trading system, but can be common with egress. Partially asynchronous systems run asynchronously for a portion of the time, and synchronously for a different portion. With a consensus based design, sequencers or match engines, we can consider the behaviors of independent gateways as asynchronous and any interactions with the central core as synchronous. Thus, typical trading systems run as partially synchronous systems. 2

--- Views
review policy

last updated
Distributed Systems

© 2009-2023 Shaun Laurens. All Rights Reserved.