Chain Replication, Craq & Hermes

Highly available cluster coordination with high throughput and strong consistency guarantees

The Chain Replication family offers an interesting approach to build systems with high throughput and high availability while still offering strong consistency guarantees. The protocols are especially useful for read heavy loads. Three protocols are discussed below:

  • Chain Replication (2004)
  • Chain Replication with Apportioned Queries (2009)
  • Hermes (2020)

Chain Replication

Chain replication has a very different structure to most high availability services, for example:

  • Chain replication has no capability to elect a leader or deal with split brain internally; it must rely on an external configuration manager to provide this service
  • Chain replication can offer higher performance than most other replication protocols while also preserving linearizability. The head node is not as heavily taxed as RAFT's leader, and since read operations bypass the leader, there is no need to replicate read commands as you must do in RAFT.
  • Unlike RAFT, which can carry on providing high performance even with a slow follower, Chain Replication's performance will suffer if any node runs slowly.
  • Again, unlike RAFT, chain replication can be very easily sharded.

A typical deployment architecture would look something like this:

  • All nodes connect to a configuration manager
  • The connectivity logically flows from a head node to a tail node via a chain of replicas
  • Clients connect to both the head and tail nodes, depending on the operation they are performing.

Protocol

Assumptions:

  • The nodes are connected to a highly available configuration manager (for example, ZooKeeper), and all nodes are correctly configured in a chain.
  • One node is set to be the head, another the tail, and a chain of intermediate nodes (replicas) are between them.
  • Writes are sequenced (i.e. have a monotonically incrementing counter assigned)

The write protocol operates as follows:

  • client connects to the head node, and submits the write request
  • the head updates its internal state and forwards the update to the next replica in the chain
  • the process repeats until the tail node has received the update
  • once the tail node processes the update, the tail replies to the client and the update is committed.

The read protocol operates as follows:

  • client connects to the tail node, and submits the read request
  • the tail node responds with the latest value.

Fail-over scenarios are simple to understand, and do not suffer the same complexity found in RAFT:

  • If the head node fails, the configuration manager will notice and take it out of service. The next replica in the chain is selected to be the head, and the configuration of the chain is updated.
  • If a replica node fails, the configuration manager will again notice and take it out of service. The chain is reconfigured to ignore the out of service replica. After reconfiguration, the node to the right side (i.e. tail side) will need to verify from the node to its left (head side) what writes are missing and apply them - the sequence can be used for this.
  • If a tail node fails, the configuration manager will once again notice and take it out of service. The chain is reconfigured so that the right most replica (the one prior in the chain to the tail) becomes the new tail node.
  • In the case of node failure, all nodes are required to keep on trying to send messages further down the chain until they receive an updated configuration from the configuration manager.

In terms of the clients:

  • If the client submits an update to the head, and it crashes without submitting it further, then the client will not receive an acknowledgement. The client must then retry the request.
  • If the client submits an update to the head, and it crashes after submitting it further, then the client will receive an acknowledgement from the tail. The client has nothing further to do.
  • If the client submits an update to the head, and the tail crashes before acking, then the client will not receive an acknowledgement from the tail. The chain will be reconfigured and there will be a new tail selected. The client must retry the request - but duplicate suppression (working from the client's sequence) will lead to the write being ignored (Open question on how the client receives a write acknowledgement in this scenario).

CRAQ

Chain Replication provides good performance, but the tail can become a hotspot. CRAQ (Chain Replication with Apportioned Queries) seeks to improve throughput by allowing reads from any node but the tail. Much like Chain replication, CRAQ requires an external configuration manager to monitor node health and layout.

The CRAQ protocol differs from Chain replication by introducing dirty and clean states atop versioned objects, and having the head node respond to writes after the write acknowledgement makes its way from the tail to head node.

The write protocol operates as follows:

  • each node keeps a list of versions (with the version number monotonically increasing) of each object; each version can be clean or dirty
  • when the head node receives a write request, it adds that version to its local list of versions of the key and marks it dirty, and forwards the update down the chain to towards the tail
  • once the new version reaches the tail, the tail accepts the new version and marks it clean. The version is now committed, and the tail sends ACK back up the chain to the head.
  • as each ACK moves from the tail to the head, it marks the local version clean

The read protocol operates as follows (note the tail cannot be read from):

  • if the latest version of the requested object is clean, it is returned immediately
  • if the latest version of the requested object is dirty, the node will reach out to the tail to get the latest version, and then return that.

Note that race conditions can happen with CRAQ, for example:

  • Object A gets written to the head, gets versioned as version 3
  • Head submits the write to replica A, which sets the version to 3 and marks it dirty
  • Client requests object A from replica A.
  • Replica A requests the committed version from the tail, returning version 2 to the client
  • Tail commits version 3, which is now clean

This problem is however generally acceptable since CRAQ ensures that reads are monotonic on the whole chain. In other words, it would be impossible for a read to return an earlier version, only the same or a newer version, regardless of which nodes the read requests went to.

The CRAQ paper suggests an improvement that could be added: using multicast to update nodes on write requests. In this case the protocol differs slightly - the changes are multicast, and then a new metadata event is sent down the chain inform to ensure that all nodes are aware of an update. If the node has no knowledge of the update (i.e. multicast message was not delivered to the node), the node can request the data from its predecessor in the chain and then continue sending the metadata message down the chain. Once the metadata message has reached the tail, a multicast ACK is sent by the tail to commit the update.

Hermes

Hermes evolves the approach of CRAQ and multicast to deliver further improvements to both read and write throughput. One of the biggest differences is that in Hermes the head and tail nodes are collapsed into a single node called the coordinator. To improve throughput, Hermes allows parallel writes to different keys, although writes to the same key are performed sequentially.

In the paper, the authors show Hermes responding to 400 million requests a second at ~45us at the 99th percentile with 95% read to 5% write ratio. This compares with CRAQ at 150us at the 99th percentile at 400 million requests/second, also with a 95% read to 5% write ratio. Note that testing for both CRAQ and Hermes was performed using Mellanox RDMA hardware.

Notes:

  • an external membership service is responsible for node health and layout, much like in CRAQ and Chain Replication
  • each versioned object can be in one of the following states: Valid, Write, Invalid, Replay and Trans (as in transient).
  • each versioned object is versioned with Lamport clocks
  • Hermes also supports Read-Modify-Write (RMW) operations; this is not discussed here - please refer to the paper for more detail. The RMW support allows Hermes to support Replicated State Machines

The Hermes protocol operates as follows for writes:

  • when writing, the client submits the write to any node
  • the receiving node assumes the role of a coordinator and confirms that the key to be written is currently in the Valid state. If the node's key is not in the Valid state, the write is stalled until it is. If the node's key is in the Valid state, it increments the lamport clock for the key version and appends a node id and assigns this as the timestamp for the write.
  • the coordinator node then broadcasts an invalidation message to the other replicas. This contains the key, new value and the timestamp. The coordinator node then sets the key state to Write.
  • the follower nodes then receive the invalidation message from the coordinator. The follower checks the timestamp received against its local timestamp of the key. If the timestamp is higher than the local timestamp:
    • the key is moved to the Invalid state, or to the Trans state if it was in the Write or Replay state
    • the follower then updates the key to the new value and updates the local timestamp to the received timestamp
  • the follower then sends back an ACK with the timestamp received in the invalidation message.
  • once the coordinator receives the ACK messages from all live nodes (as defined by the membership service), it transitions the key to the Valid state unless the key was in the Trans state, in which case it is set to the Invalid state.
  • if the coordinator transitioned to the Valid state, it broadcasts a validated message to the followers. This message includes the current timestamp of the key.
  • the followers then receive the validated message, and transition to Valid if the timestamp matches their local key. If the timestamp does not match, the message is ignored.

The Hermes read protocol is as follows:

  • the client submits the read request to any node
  • the node responds immediately if the key state is Valid, otherwise the read is stalled until the key becomes Valid

Change log

  • Added 20 December 2020
  • Updated 27 December 2020 - migrated note to article, added details on Chain Replication
  • Updated 28 December 2020 - Added details on CRAQ and Hermes

References

Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, USA, 7. doi: 10.5555/1251254.1251261
Jeff Terrace and Michael J. Freedman. 2009. Object storage on CRAQ: High-throughput chain replication for read-mostly workloads. USENIX Association, USA: USENIX
Antonios Katsarakis, Vasilis Gavrielatos, M.R. Siavash Katebzadeh, Arpit Joshi, Aleksandar Dragojevic, Boris Grot, and Vijay Nagarajan. 2020. Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 201–217. doi: 10.1145/3373376.3378496
Metadata
--- Views
status
draft
importance
low
review policy
continuous
reading time
2 min read
published
2020-12-20

last updated
2020-12-28
Topics
Distributed Systems

© 2009-2021 Shaun Laurens. All Rights Reserved.