Kafka Vs. Aeron
Overview
There is some confusion about the similarities and differences between Kafka and Aeron. This article aims to help answer some questions.
Item | Kafka | Aeron |
---|---|---|
Simple producer-consumer flow | Kafka built in | Aeron |
Interprocess flows on same host | remote via Kafka | Aeron IPC |
One to many flow | Kafka built in | Aeron (multicast or MDC) |
Many to one flow | Kafka built in | Aeron |
Many to many flows | Kafka built in | Aeron (multicast or MDC as needed) |
Historical replay | Kafka built in | Aeron Archive |
Resilience to server failures | with Kafka Cluster + ZK Cluster | Aeron Archive1 or Aeron Cluster |
Message Ordering | Optional within same producer session on a partition2 | Guaranteed within same session on a channel+stream3 |
Global Total Ordering | Not possible | Aeron Cluster |
Stream processing | Custom code + Kafka Streams | Custom code + Aeron Archive/Aeron Cluster |
Producer Style | Async or Sync (with partial or full cluster ack) | Sync (to local log buffer or Aeron Cluster majority) |
Consumer Style | Poll (with max blocking duration) | Poll (with max number of items) |
Optimised for | Small messages (1kb) | Small messages (1kb or less) |
Throughput | Millions of messages/second | Millions of messages/second |
Latency at 99.9th percentile | Sub-second, likely in hundreds of milliseconds | Aeron IPC: under 1μs4; Aeron5/Aeron Archive6: 4-5μs; Aeron Cluster: under 100μs7 |
Batching support | Time and/or Size based8 | Natural Batching9 |
Largest message size | 1MB+10 | 16MB11 |
Production Deployment (excl producers and consumers) | 5-8+ servers | 0-3+ servers (Aeron Cluster is best with minimum 3 servers) |
Zero copying capable on hot path | No | Yes12 |
Architecture
In the scenario where we want a simple producer-consumer flow which can deliver the absolute maximum performance in a production ready environment, the logical deployment architectures would be as follows:
Kafka Architecture
This deployment includes:
- The producer and consumer processes
- A 3 node Kafka cluster
- A 5 node Zookeeper cluster to support the Kafka cluster
As of Kafka version 2.8.0, Kafka includes a preview of the self-managed quorum. For production use Zookeeper is still recommended. As a result, you must deploy both the Zookeeper and Kafka clusters.
Aeron Architecture
This deployment includes:
- The producer and consumer processes
This comparison is a bit unfair though. Kafka could do more, and can offer a higher degree of reliability - for example, the producer process and consumer process are able to run at different times, with minimal impact on each individual processes flow. Aeron can also achieve this, via the usage of Aeron Archive.
If we wanted the Producer to be able to write without the Consumer, or to allow the Consumer to read historical data (e.g. last 7 days or similar), we could swap them over to Aeron Archive:
Again, the deployment is limited to the Producer and Consumer. The performance for an archive replay is limited by the disk performance. Once historical data has been replayed, the live data stream performance is close to theoretical network speed limits.
Aeron Archive will not however deal with server failure. You could easily replicate the data to a backup server. Alternatively, you could swap to Aeron Cluster. Normally, you would deploy business logic inside Aeron Cluster, however, there are cases where it would be valid to just use it as a sequencer, in which case Aeron Cluster provides resilient Global Total Ordering with many producers and consumers. Aeron Cluster includes the equivalent functionality of Zookeeper, so no additional cluster is needed.
Performance for Aeron Cluster remains excellent, with latencies under heavy load on appropriate hardware well under 100 microseconds at the 99th percentile.
Producers
This section looks at the similarities and differences between Kafka and Aeron when used via a Producer process.
Producers in Kafka
Kafka supports both synchronous and asynchronous scenarios with the producer APIs. Additionally, via configuration of the API, the Kafka client can offer different levels of guarantees:
- If
acks
is set to0
, there are no guarantees. The producer writes to the buffer, and considers the message sent. This is the weakest guarantee Kafka offers. This mode offers the lowest latency offered by Kafka. - If
acks
is set to1
, you're guaranteed that the broker partition leader has received the message - but not yet that all nodes of the partition cluster have committed the data. This mode significantly increases latency. - If
acks
is set to-1
orAll
, you're guaranteed that the broker partition leader has received the message and all in-sync replicas have accepted the message. You would typically apply this along withmin.insync.replicas
set to a value > 1. This is the strongest guarantee Kafka offers. This mode increases latency over theacks=1
.
Sending a message synchronously is a matter of producing a record
, calling send
and awaiting the future:
final ProducerRecord<K, V> = new ProducerRecord<>(topic, key, value);
final Future<RecordMetadata> future = producer.send(record);
RecordMetadata metadata = future.get();
Asynchronous sends are similar, although the send operation will return the moment the data is written to the buffer of records pending send. Note that the callback is executed on the producer's I/O thread, so care must be taken to not perform long-running tasks in it.
final ProducerRecord<K, V> = new ProducerRecord<>(topic, key, value);
producer.send(record, new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null)
log.debug("error sending {}", record, e);
}
});
In both scenarios, Kafka allows the sender to:
- Specify a topic and optionally a key, which is used to guide the producer client to select the appropriate cluster partition.
- The value, which is typically an Avro byte buffer
The Kafka producer API provides back-pressure support via buffer and max block waiting time configuration.
At construction time of the Kafka producer, you can override the buffer.memory
size beyond the default, and set a max.block.ms
value to specify how long the client must block for if the broker is unable to accept messages from your connection.
If the producer is sending records faster than Kafka's broker can consume, the Kafka broker will apply back-pressure by delaying the sending of ack messages.
Once the blocking timeout is exceeded, an exception is thrown. The producer is required to take the necessary corrective action.
The Kafka producer API is thread safe.
Publishers in Aeron
Aeron has no broker (unless you're executing the Media Driver as an external process on the local server, in which case it's acting as a distributed broker)13, and the guarantees it offers are different.
To send data to a destination in Aeron, you must first create a Publication
14. By defining the publication, you're telling Aeron where to send data (host & port), and to which stream to send to:
//assumes there is already a Media Driver and Aeron client setup
//send to remote host 10.10.123.241 on port 12345
final String channel = "aeron:udp?endpoint=10.10.123.241:12345";
//send on stream 10. Multiple streams can be supported on the same endpoint
final int streamId = 10;
final Publication pub = aeron.addPublication(channel, streamId))
Sending in Aeron is performed using a Publication
, and only accepts a DirectBuffer as the message content:
final DirectBuffer content = ...
final long result = pub.offer(content);
All sending in Aeron is synchronous to the local log buffer15.
This log buffer is a memory mapped file typically stored on \dev\shm
containing all inflight messages, plus a window of historical messages to support data loss recovery by the consumer.
Aeron does not offer a way to await an acknowledgement of the message from the remote host (the equivalent of either acks
= 1
or -1
or all
), but the guarantees are stronger than that offered by Kafka's acks=0
since temporal network failures are recoverable.
If Aeron Cluster is used, and deployed as a sequencer, you can get the approximate equivalent guarantees of acks
= -1
or all
.
If both sides are running correctly, and the consumer is actively consuming messages, then Aeron ensures that the messages are delivered in order sent to the remote host despite UDP message loss and out of order message delivery.
The result returned14 from the publication's offer
call is used to understand the current state of the connection and delivery.
- a value of -1 indicates that there is no subscription connected. Subscriptions can come and go naturally, so this does not indicate an error. If there is nothing to receive the message, why send it?
- a value of -2 indicates that the offer was back-pressured. The producer is required to wait for the consumer, or to take corrective action.
- a value of -3 indicates that an admin action was underway as the log buffer terms were being rotated at that moment. The application should attempt the offer again.
- a value of -4 indicates that the publication is now closed and is unable to accept data. This is typically because something else in the application closed the publication.
- a value of -5 indicates that offer failed due to the log buffer reaching the maximum position of the stream given term buffer length multiplied by three. When this happens, it is suggested the term buffer size be increased and/or the message size decreased.
Aeron offers users both a thread safe (ConcurrentPublication
) and non-thread safe (ExclusivePublication
) API. The ExclusivePublication
offers slightly improved latency since all locking code is removed.
Consumers
The mechanisms for building Consumers is logically similar across Kafka and Aeron and Aeron Archive. Aeron Cluster is however fairly different.
Consumers in Kafka
//define the connection
Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092");
props.setProperty("group.id", "test");
props.setProperty("enable.auto.commit", "true");
props.setProperty("auto.commit.interval.ms", "1000");
props.setProperty("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
//connect
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
//subscribe to topics 'topic-a' and 'topic-b'
consumer.subscribe(Arrays.asList("topic-a", "topic-b"));
//poll the client for new data, waiting (and thus blocking) for a
// maximum of 100ms between each poll.
while (true)
{
ConsumerRecords<String, String> records =
consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records)
{
System.out.printf("offset = %d, key = %s, value = %s%n",
record.offset(), record.key(), record.value());
}
}
The core is the consumer.poll
function - this is fetching new records for the subscribed topics.
It will try to consume from the same offset, so that data is supplied sequentially.
Subscribers in Aeron and Aeron Archive
//listen on localhost port 12345
final String channel = "aeron:udp?endpoint=localhost:12345";
//listen on stream 10
final int streamId = 10;
//setup the subscription; assumes Aeron and Media driver available
final Subscription subscription = aeron.addSubscription(channel, stream);
//dutyCycle would typically be in an Agrona Agent and be managed
//by an IdleStrategy; this is a rough idea of what's happening:
public void dutyCycle()
{
while (true)
{
//if the process is still processing data on the subscription
//then don't idle, keep in a tight loop until done
if (doWork() <= 0)
{
Thread.onSpinWait(); //or similar
}
}
}
public int doWork()
{
//poll up to 10 items at a time
return subscription.poll(this::handler, 10);
}
private void handler(DirectBuffer buffer, int offset, int length,
Header header)
{
//consume the message
}
As you can see, the polling concept is common to both Kafka and Aeron & Aeron Archive.
Aeron does not block the poll however - the code can optionally idle, if that's a requirement for performance and resource management reasons. Agrona agents that host one or more Subscription
objects can be scheduled on dedicated, or shared threads as needed.
Another subtle difference is Aeron will poll a maximum number of items, while Kafka will poll for a maximum amount of time.
Both can offer more advanced capabilities:
- Read from a given offset (limited in Aeron's case to Aeron Archive clients). Note that offsets in Kafka refer to the message number, while in Aeron offsets refer to a position within a stream
- Commit/rollback a read (requires use of the
ControlledPoll
in Aeron) - Kafka allows the creation of Consumer Groups, which allow efficient scale out of processing
- Multiple producers can send to a single Aeron Subscription; each subscription can be individually identified via it's session (provided via the
Header
parameter)
Aeron Cluster
Aeron Cluster operates a Replicated State Machine, which requires that messages are delivered sequentially. As a result, you have no ability to poll and cannot request more than 1 message at a time. Under the covers, Aeron is batching the messages, on your behalf, so there is little need to worry about throughput in this scenario.
Consuming a message in Aeron Cluster is a matter of wiring up the Cluster client, and processing the onSessionMessage
events.
@Override
public void onSessionMessage(ClientSession session, long timestamp,
DirectBuffer buffer, int offset,
int length, Header header)
{
//consume the message
}
Change Log
- Updated 29 November 2020 to include additional clarity around getting low latency with Aeron Archive, and fixed footnotes within the comparison table.
- Updated 11 February 2021 with clarification around offset differences between Aeron and Kafka
- Updated 21 June 2021 with note around Kafka self hosted quorum in Kafka 2.8.0
Footnotes
-
archiveresilienceTo get reliable Aeron Archive on the consumer side, without any double processing, you'd need to store consumed offset and replay from that offset. To achieve reliability on the producer side, you would need to replicate the archive in realtime to back up server(s), or store the data on a remote high-speed storage array. Aeron has built in support for real-time replication, see Aeron Cookbook Replication Sample ↩
-
msgorderThis requires that max.in.flight.requests.per.connection is changed from default value to 1 ↩
-
streamSee Impact on ordering in Aeron Cookbook - Channels, Streams and Sessions ↩
-
ipcmaxTo achieve performance of under 1 microsecond on Aeron IPC, you will need to make use of a very high performance codec such as Simple Binary Encoding, keep messages small (roughly 32 bytes or fewer), use tryClaim, and tune the duty cycles of Aeron and the Agrona Agents carefully. See Aeron Cookbook on Agents & Idle Strategies for details on Agents & Idle Strategies, and Aeron Cookbook - Aeron IPC for an example that sends 4 byte messages via IPC at a rate of ±625ns/message. ↩
-
aeronmaxTo achieve maximum performance with Aeron, you would need to use Solarflare cards, the premium ef_vi transport bindings, high performance network switches and dedicated hardware. You would also need to tune the Linux operating system, carefully select the NUMA layout, and pin core Aeron threads to specific CPU cores. Aeron operates without issue on the cloud, but latency drops off to tens of microseconds or worse, depending on the server and network configuration, host and network utilization and rack distances etc. ↩
-
archivemaxFor Aeron Archive to get this latency, it would need to be streaming live. Replay from disk will be slower, and will depend on the hardware used. ↩
-
clustermaxFor Aeron Cluster to achive low latency, you would need to be running in a 3 node cluster, only commit the log to page memory (i.e. not force disk writes), and use hardware in line with aeronmax above. ↩
-
kafkabatchSee linger.ms for time based batching configuration, and batch.size settings in the Producer, and max.poll.records for the Consumer. ↩
-
naturalbatchAeron will automatically batch data sent by producers based upon current throughput and process activity. For consumers the batch maximum size can be controlled via the poll call, but actual message batch size depends on message availability. Under the covers, the Aeron Subscription is performing Natural Batching as well. ↩
-
kafkamsgsizeCan be changed via overriding configuration settings on the Topic, Broker, Consumer and Producer. See message.max.bytes, replica.fetch.max.bytes, max.request.size, fetch.message.max.bytes. You will also likely need to tune compression, linger and buffer memory settings. Additionally, you may also have to alter JVM settings. ↩
-
aeronmsgsizeControlled via configuration for each channel, or globally. See Aeron Cookbook ↩
-
zcpyTo run with zero copying, you would be restricted to sending messages smaller than the maxPayloadLength, which is derived from the UDP MTU. See Aeron Cookbook - Publications - Try Claim ↩
-
mdThe Media Driver and its role with Aeron is described at Aeron Cookbook - Media Driver ↩
-
publicationPublications are described in depth at Aeron Cookbook - Publications ↩ ↩2
-
logbufferLog Buffers, and how they are used by Aeron is described at Aeron Cookbook - Log Buffers ↩
final
continuous
2020-11-15
last updated
2021-06-21