Resilient Gateways

Adding resilience to Aeron Cluster Gateways


Leadership election and optimal task balancing are vital functions in any distributed system. Adding logic into Aeron Cluster to perform this is simple and can allow the wider system estate to be simplified.

Valid Scenarios

Systems without extremely strict performance requirements or in any scenario where the cluster or cluster log needs to be protected from unnecessary inputs. Systems where the cluster needs or already manages the underlying data; this should not be used in a scenario where it would force data through a cluster just to gain a simplified estate.

Aeron Cluster provides a RAFT service container for running your business logic. By pushing some resilience logic into the cluster, you can simplify the architecture of the system while still meeting strict performance and resilience guarantees.

Let's consider an example common in trading systems - outbound trade integration flows. We start with a trading system which contains the core system state within an Aeron cluster. For every trading party involved, we need to open up exactly one API connection to a third party system so that we can send trade confirmation data. Additionally, we must ensure that the system has built-in resilience and can recover itself from typical failures.

Reminder: To ensure determinism, RAFT discourages any I/O from cluster nodes apart from those sent via the RAFT protocol.

Approach 1: Use an external service to manage API gateway connectivity and resilience

API Gateway 2Consul LeaderRemote SystemAPI Gateway 1Cluster LeaderCluster Follower 1Cluster Follower 2Consul Node 2Consul Node 3

To achieve exactly once connectivity with resilience, we can use locks on key-value-pairs in Consul (or znodes in Zookeeper). At start have the API gateways receive a list of connections to open. Then, each connection is mapped to a key-value pair, and the API gateways compete to take out the locks on each key-value-pair. When a node has obtained the lock, it is free to open the outbound connectivity. Similarly, when the node does not have the lock, it cannot open the connection.

To introduce resilience against a failed API Gateway process, lock timeouts and regular keep-alive requests are made from each API Gateway. The moment a lock is freed, all other connected API Gateways will be able to again try claim the lock, and then reopen connectivity.

If processes start minutes apart, the first node to start will likely get the majority of work.

Approach 2: Build the logic into the Aeron Cluster service container

Remote SystemAPI Gateway 1Cluster LeaderCluster Follower 1Cluster Follower 2API Gateway 2

Moving the logic of resilient task distribution into the cluster can provide a better solution. We can safely construct this logic within Aeron cluster because:

  • Aeron cluster fires callbacks when a cluster client connects and disconnects
  • We can direct outbound events from the cluster to any particular cluster client we elect

The logic inside the cluster would need to have the following components:

  • On start, the cluster would need to know the list of connections to make. This can be loaded via the usual commands
  • When a cluster client connects, we would need a command from the client registering itself with its capabilities. For example it may be RegisterClient(API_GATEWAY).
  • We can add a simple object which is then capable of knowing which Aeron Cluster clients are connected at any time, and of which type they are
  • When outbound connectivity is needed, the logic can hold a hashmap of API Gateways to outbound connection, and define the rules around how to distribute the outbound connections. When a node is to open a connection, the cluster can direct a ConnectionReadyToOpenEvent(party) to the appropriate API Gateway.
  • If a connected API Gateway is lost, the cluster logic can receive the event, and redistribute the tasks across the remaining connected services

Options Compared

Aeron Cluster

Less moving partsNo management UI, this would have to be built
Full control over node balancingMore code to maintain
Fast recovery times (1-2s)


Known toolLocks have TTL of no-less-than 10s
Easy to manage, full UIThree+ additional processes to deploy
Uneven task distribution
--- Views
review policy

last updated
Distributed Systems

© 2009-2023 Shaun Laurens. All Rights Reserved.