Eider Retrospective

Lessons learnt building Eider, and some considerations for a version 2

As a part of building the samples from Aeron Cookbook, I needed to demonstrate the approach of using off heap data structures within an Aeron Cluster. There is no open source solution that I'm aware of that I could use, so out of necessity I built Eider.

What is Eider, and what problems does it solve?

When building systems for Aeron Cluster with strict performance requirements, you are going to find yourself in a battle against Little's Law 1. Little's Law gives us an upper bound on the performance of Aeron Cluster - because Replicated State Machines are single threaded, the throughput of an Aeron Cluster is predominantly determined by how fast the logic is within the clustered service. One approach to reduce the processing time down as far as possible is to switch to off heap data structures, and to make use of cluster codecs which do not require serialization or deserialization.

There are four areas in which using efficient codecs and off heap data structures can help increase throughput:

  • cluster codecs for inbound commands and outbound events,
  • the domain model holding state within the cluster,
  • the snapshot model, which holds the snapshot format and is used to store snapshots of the cluster state,
  • the replication model, which allows parts of the cluster state to be replicated to cluster clients. Not all scenarios require replication, so this is optional.
Inbound CommandsOutbound EventsReplication MediumSnapshot MediumFlyweightFlyweightDomain Model atop DirectBuffers or plain POJOSnapshotReplication

With the first version of Eider, the following features were supported:

  • support for cluster codecs for messaging, off heap domain model repositories and an efficient snapshot model
  • Eider operates as an annotation processor, working off annotated plain old Java objects
  • easy integration with Aeron (demuxers to guide the inbound messages through switch statements to code processing the messages) and Aeron Cluster (snapshot/load from image)

Within the domain model, the off heap repositories support:

  • limited allocation - limited to the indexes, which is in turn driven by the data
  • allows indexes on any number of field
  • sequences, using Atomic getAndAdd operations
  • transactions, with support for dirty reads and rollback
  • fixed size buffer and records

Types are limited to those useful within a high performance financial system:

  • boolean
  • short
  • int
  • long
  • fixed length ASCII strings.

A sample object for delivering both a message, and a repository is:

@EiderRepository(name = "RfqsRepository")
@EiderSpec(name = "RfqFlyweight")
public class Rfq
{
    @EiderAttribute(key = true)
    private int id;
    private short state;
    private long creationTime;
    private long expiryTime;
    private long lastUpdate;
    private int lastUpdateUser;
    @EiderAttribute(indexed = true)
    private int requester;
    @EiderAttribute(indexed = true)
    private int responder;
    private int securityId;
    private int requesterCorrelationId;
    private short side;
    private long quantity;
    private long lastPrice;
    @EiderAttribute(indexed = true)
    private long clusterSession;
}

From this input object, Eider's annotation processor generated message codecs and repositories with decent performance characteristics. For the message codecs, round trip time (which is encoding to buffer, and reading from a buffer) is 162 nanoseconds per operation at 99.9th percentile. During the tests, the memory allocations averaged 26kb a second, which is minimal given the 3.16 million executions of the benchmark.

Benchmark                           Mode      Cnt       Score    Error    Units
                                  sample  3163310      55.119 ±  0.820    ns/op
roundtrip·p0.00                   sample                1.000             ns/op
roundtrip·p0.50                   sample               47.000             ns/op
roundtrip·p0.90                   sample               52.000             ns/op
roundtrip·p0.95                   sample               54.000             ns/op
roundtrip·p0.99                   sample               56.000             ns/op
roundtrip·p0.999                  sample              161.689             ns/op
roundtrip·p0.9999                 sample            19562.810             ns/op
roundtrip·p1.00                   sample           280064.000             ns/op
·gc.alloc.rate                    sample       15       0.025 ±  0.001    MB/sec
·gc.alloc.rate.norm               sample       15       0.001 ±  0.001    B/op
·gc.count                         sample       150             counts

Within the repository, performance remains similar with the creation of a new item in the repository taking around 2.9 microseconds at the 99.9th percentile. Allocation rates were a bit higher, at 16MB/sec. This higher allocation is largely driven by the indexing functionality as the repository is indexed on 3 fields.

Benchmark                           Mode      Cnt       Score    Error   Units
                                  sample  1323481       0.441 ±  0.295   us/op
createRfq·p0.00                   sample                0.025            us/op
createRfq·p0.50                   sample                0.059            us/op
createRfq·p0.90                   sample                0.071            us/op
createRfq·p0.95                   sample                0.075            us/op
createRfq·p0.99                   sample                0.085            us/op
createRfq·p0.999                  sample                2.936            us/op
createRfq·p0.9999                 sample               26.890            us/op
createRfq·p1.00                   sample            72744.960            us/op
·gc.alloc.rate                    sample        5      16.466 ±  0.191   MB/sec
·gc.alloc.rate.norm               sample        5       0.562 ±  0.022   B/op
·gc.churn.G1_Eden_Space           sample        5       0.228 ±  0.003   MB/sec
·gc.churn.G1_Eden_Space.norm      sample        5       0.008 ±  0.001   B/op
·gc.churn.G1_Old_Gen              sample        5       6.831 ±  0.083   MB/sec
·gc.churn.G1_Old_Gen.norm         sample        5       0.233 ±  0.009   B/op
·gc.churn.G1_Survivor_Space       sample        5       0.135 ±  0.003   MB/sec
·gc.churn.G1_Survivor_Space.norm  sample        5       0.005 ±  0.001   B/op
·gc.count                         sample        5      35.000            counts
·gc.time                          sample        5      95.000            ms

Downsides of the Eider approach:

  • using an annotation processor can be painful with both Gradle and Maven. External tools, like the sbe-tool, are simpler to use
  • if you want to make use of both SBE and Eider - for example, by mixing SBE for messaging and snapshotting, but keeping Eider for the repository, then the integration between the two is very poor. For example, there is no mechanism to share enums.
  • there is no easy way to replicate changes to cluster clients that is built into Eider. You could of course just submit the updates to the repository in realtime to the any clients, but this can be suboptimal in some cases (for example with late joining clients - where does the initial state come from? This could be via an Aeron Archive, but this then requires a full replay of all changes to the repository).
  • transactional support requires copying the entire buffer - this can be expensive with large underlying buffers
  • Eider has no versioning support, which isn't ideal as it isn't always possible or desirable to deploy components in one go
  • indexes require allocations as new data elements are indexed
  • Strings are used directly, resulting in a large amount of allocations when Eider is working with a fixed length ASCII field.

Eider 2.0 improvements

If I were to do build an Eider 2.0, I'd prioritize the following changes:

  • Remove the annotation processor, and switch to an XML based configuration and command line tool for codec and repository generation. This would function in a similar mechanism to sbe-tool. I've taken a look at a number of other approaches (including JSON, HOCON, YAML, TOML, Hashicorp HCL and a custom DSL) for describing repositories and codecs, and none are as efficient or as clear as XML.
  • I'd switch the focus of Eider onto Repositories plus a library to efficiently replicate them. Messaging and Snapshot codecs would still be supported, but the focus would be on Repositories
  • I'd move to a more efficient approach for indexing which would remove any need for runtime allocation, but still support dynamic data. This would - like the repositories themselves - need to be presized upon creation
  • Strings would be rendered on demand only, with a CharSet of similar interal representation. This would reduce allocation, although it would be unavoidable once rendered to a String object
  • Add support for deleting items in the repository. This could be either with soft deletes + on demand vacuum or immediate deletes
  • Improve indexes, allowing for both ordered and unordered indexes along with the necessary allocation free iterators.
  • Ability to register for callbacks that provide allocation percentage warnings, for example, a callback informing the code that the repository has reached 95% of available capacity
  • Efficient transactions. I still believe dirty reads are the most appropriate mechanism, however, only modified items should be copied to the rollback log. This would significantly reduce the cost of running transactional repositories.
  • Optional debug instrumentation to simplify solving some problems developers face with flyweight driven repositories. A typical example of this is accidental flyweight reuse.
  • Optional use of Unsafe. This would allow for both on heap and off heap data structures. Some code (for example sequences) would need buffer implementation specific logic, but that's easy to do.
  • Introduce enums, similar to Simple Binary Encoding. They would still be based off of a char/short/int/long, but would present the developer with a friendlier interface.
  • Introduce versioning for codecs, with specific rules around forwards and backwards compatibility
  • Within Snapshots, introduce application provided metadata, plus hooks to read it before applying the snapshot. This is useful for things such as ensuring that the exact same code (as measured with a Git SHA-1) is reading as wrote the snapshot.
  • Introduce an intermediate representation, allowing for on the fly decoding, which can be useful for operational tooling

The features I don't see changing are:

  • single threaded access only
  • the type restrictions. There's not much need for more complex types in a typical financial system
  • use of Agrona DirectBuffers

Footnotes

  1. See Aeron Cookbook on Cluster Performance Limits

Metadata
--- Views
status
final
review policy
one off
published
2020-11-29

last updated
2020-11-29
Topics
Distributed Systems

© 2009-2023 Shaun Laurens. All Rights Reserved.