What is Kafka?

Apache Kafka is an open-source publish-subscribe messaging system, often described as a distributed event log where all new records are immutable and appended to the end of the log. Kafka differs from other messaging systems by persisting messages on disk with a configurable retention policy, making it a hybrid between a messaging system and a database.

The main concepts behind Kafka include producers who send messages to different topics, and consumers who consume these messages while maintaining their position in the stream of data. Kafka aims to provide a reliable and high-throughput platform for handling real-time data streams and building data pipelines. It also serves as a single place for storing and distributing events that can be fed into multiple downstream systems, simplifying integration complexity.

Deep dive

Before we delve into Kafka, let's start with a quick recap on what publish-subscribe messaging is. Publish-subscribe is a messaging pattern where the sender does not send data directly to a specific receiver. Instead, the publisher classifies the messages, without knowing if there are any subscribers interested in a particular type of message. Similarly, the receiver subscribes to receive a certain class of messages without knowing if there are any senders sending those messages. Pub-sub systems usually have a broker where all messages are published; this decouples publishers from subscribers and allows for greater flexibility in the type of data that subscribers want to receive. It also reduces the number of potential connections between publishers and subscribers.

A bulletin board comes in handy as a good analogy for a pub-sub messaging pattern, where people can publish information in a central place without knowing who the recipients are. So, what is Kafka then? Apache Kafka is an open-source publish-subscribe messaging system, also often described as a distributed event log where all new records are immutable and appended to the end of the log.

In Kafka, messages are persisted on disk for a certain period, known as the retention policy. This is usually the main difference between Kafka and other messaging systems and makes Kafka, in some ways, a hybrid between a messaging system and a database. The main concepts behind Kafka are producers producing messages to different topics and consumers consuming those messages and maintaining their position in the stream of data.

You can think of producers as publishers or senders of messages; consumers, on the other hand, are analogous to the receivers or subscribers. Kafka aims to provide a reliable and high-throughput platform for handling real-time data streams and building data pipelines. It also provides a single place for storing and distributing events, which can be fed into multiple downstream systems, helping to tackle the ever-growing problem of integration complexity.

Besides all of that, Kafka can also be easily used to build a modern and scalable ETL, change data capture, or big data ingest systems. Kafka is used across multiple industries by companies such as Twitter, Netflix, Goldman Sachs, and PayPal. It was originally developed by LinkedIn and open-sourced in 2011.

Now, let's dive a little bit deeper into the Kafka architecture. On a high level, usual Kafka architecture consists of a Kafka cluster, producers, and consumers. A single Kafka server within a cluster is called a broker. A Kafka cluster usually consists of at least three brokers to provide a sufficient level of redundancy. The broker is responsible for receiving messages from producers, assigning offsets, and committing messages to disk. It is also responsible for responding to consumers' fetch requests and serving messages. In Kafka, when messages are sent to a broker, they are sent to a particular topic. Topics provide a way of categorising data that is being sent and can be further broken down into a number of partitions.

For example, a system might have separate topics for processing new users and for processing metrics. Each partition acts as a separate commit log, and the order of messages is guaranteed only within the same partition. Being able to split a topic into multiple partitions makes scaling easy as each partition can be read by a separate consumer. This allows for achieving high throughput as both partitions and consumers can be split across multiple servers. Producers are usually other applications producing data. This can be, for example, our application producing metrics and sending them to our Kafka cluster. Similarly, consumers are usually other applications consuming data from Kafka. As mentioned before, Kafka often acts like a central hub for all the events in a system.

This means it's a perfect place to connect to if we are interested in a particular type of data. A good example would be a database that can consume and persist messages or an Elasticsearch cluster that can consume certain events and provide full-text search capabilities for other applications. Now, as we've gone through the general overview of Kafka, let's jump into the nitty-gritty details.

In Kafka, a message is a single unit of data that can be sent or received. As far as Kafka is concerned, a message is just a byte array, so the data doesn't have any special meaning to Kafka. A message can also have an optional key, also a byte array, that can be used to write data in a more controlled way to multiple partitions within the same topic.

As an example, let's assume we want to write our data to multiple partitions, as it will be easier to scale the system later. We realised that certain messages, let's say for each user, have to be written in order. If our topic has multiple partitions, there is no guarantee which messages will be written to which partitions, most likely, the new messages would be written to partitions in a round-robin fashion. To avoid that situation, we can define a consistent way for choosing the same partition based on a message key. One way of doing that would be as simple as using a user ID modulo the number of partitions. That would always assign the same partition to the same user.

Sending single messages over the network creates a lot of overhead. That's why messages are written into Kafka in batches. That is a collection of messages produced for the same topic and partition. Sending messages in batches provides a trade-off between latency and throughput and can be further controlled by adjusting a few Kafka settings. Additionally, batches can be compressed, which provides even more efficient data transfer. Even though we already established that Kafka messages are just simple byte arrays, in most cases, it makes sense to provide additional structure to the message content. There are multiple schema options available; the most popular ones are JSON, XML, Avro, or protobuf.

We already described what topics and partitions are, but let's just emphasise again the importance of not having any guarantees when it comes to message timing, ordering across multiple partitions of the same topic. The only way to achieve the ordering for all messages is to have only one partition. By doing that, we can be sure that events are always ordered by the time they were written into Kafka.

Another important concept when it comes to partitions is the fact that each partition can be hosted on a different server. This means that a single topic can be scaled horizontally across multiple servers to improve the throughput. A Kafka cluster wouldn't be very useful without its clients who are the producers and consumers of the messages.

Producers create new messages and send them to a specific topic. If a partition is not specified and the topic has multiple partitions, messages would be written into multiple partitions evenly. This can be further controlled by having a consistent message key, as we described earlier. Consumers, on the other hand, read messages they subscribe to, one or multiple topics, and read messages in the order they were produced. The consumer keeps track of its position in the stream of data by remembering what offset was already consumed. Offsets are created at the time a message is written to Kafka, and they correspond to a specific message in a specific partition. Within the same topic, multiple partitions can have different offsets, and it's up to the consumer to remember what offset each partition is at. By storing offsets in ZooKeeper or Kafka itself, a consumer can stop and restart without losing its position in the stream of data. Consumers always belong to a specific consumer group. Consumers within a consumer group work together to consume a topic. The group makes sure that each partition is only consumed by one member of a consumer group.

This way, consumers can scale horizontally to consume topics with a large number of messages. Additionally, if a single consumer fails, the remaining members of the group will rebalance the partitions to make up for the missing member. In case we want to consume the same messages multiple times, we have to make sure the consumers belong to different consumer groups.

This can be useful if we have multiple applications that have to process the same data separately. As we mentioned before, a Kafka cluster consists of multiple servers called brokers, depending on the specific hardware, a single broker can easily handle thousands of partitions and millions of messages per second. Kafka brokers are designed to operate as part of a cluster.

Within a cluster of brokers, one broker will also act as the cluster controller. The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. A partition is always owned by a single broker in the cluster, who is called the leader of the partition.

The partition may be assigned to multiple brokers, which will result in the partition being replicated. This provides redundancy of messages in the partition, such that another broker can take over leadership if there is a broker failure. However, all consumers and producers operating on that partition must connect to the leader.

One of the key features of Kafka is retention which, for some period of time, provides durable storage of messages. Kafka brokers are configured with a default retention setting for a topic, either retaining messages for some period of time, seven days by default, or until the topic reaches a certain size in bytes.

For example, one gigabyte, once these limits are reached the oldest messages expire and are deleted so that the retention configuration is a minimum amount of data available at any time. Individual topics can also configure their own retention settings. For example, a topic for storing metrics might have very short retention of a few hours.

On the other hand, a topic containing bank transfers might have a retention policy of a few months. Reliability is often discussed in terms of guarantees. These are certain behaviours of a system that should be preserved under different circumstances. Understanding those guarantees is critical for anyone trying to build reliable applications on top of Kafka.

These are the most common Kafka reliability guarantees. Kafka guarantees the order of messages in a partition. If message A was written before message B, using the same producer in the same partition, then Kafka guarantees that the offset of message B will be higher than message A. This means that consumers will read message A before message B.

Messages are considered committed when they are written to the leader and all in-sync replicas. The number of in-sync replicas and the number of acks can be configured. Committed messages won't be lost as long as at least one replica remains alive and the retention policy holds. Consumers can only read committed messages.

Kafka provides at least once message delivery semantics. It doesn't prevent duplicated messages from being produced. The important thing to note is that even though Kafka provides at least once delivery semantics, it does not provide exactly once semantics. And to achieve that we have to either rely on an external system with some support for unique keys.

Or use Kafka Streams. This also, remember that, even though these basic guarantees can be used to build a reliable system, there is much more to that. In Kafka, there are a lot of trade-offs involved in building a reliable system. The usual trade-offs are reliability and consistency versus availability, high throughput, and low latency.

Let's review both the pros and cons of choosing Kafka.

Pros:

Tackles integration complexity.
Great tool for ETL or CDC.
Great for big data ingestion.
High throughput.
Disk-based retention.
Supports multiple producers and consumers.
Highly scalable, fault-tolerant.
Fairly low latency. Highly configurable.
Provides back pressure.

Cons:

Requires a fair amount of time to understand and do not shoot yourself in the foot by accident.
Kafka might not be the best solution for hard realtime systems.

A lot of things in Kafka were purposely named to resemble a JMS-like messaging system. This makes people wonder what the actual differences between Kafka and standard JMS systems, like RabbitMQ or ActiveMQ, are. First of all, the main difference is that Kafka consumers pull messages from the brokers which allows for buffering messages for as long as the retention period holds. Most other JMS systems push messages to the consumers instead. Pushing messages to the consumers makes things like back pressure really hard to achieve. Kafka also makes replaying of events easy as messages are stored on disk and can be replayed anytime.

Besides that, Kafka guarantees the ordering of messages within one partition and it provides an easy way for building scalable and fault-tolerant systems. Time for a quick summary: in the era of ever-growing data and integration complexity, having a reliable and high-throughput messaging system that can be easily scaled is a must. Kafka seems to be one of the best available options that meet those criteria. It has been battle-tested for years by some of the biggest companies in the world. We have to remember that Kafka is a fairly complex messaging system, and there is a lot to learn to make full potential of it without shooting ourselves in the foot.

Kafka's Architecture

Kafka's architecture is built around a Kafka cluster, consisting of producers, consumers, and brokers. A single Kafka server within a cluster is called a broker. A Kafka cluster usually consists of multiple brokers to ensure redundancy and high availability.

Messages in Kafka are categorised into topics, which can be further divided into partitions. Each partition acts as a separate commit log, and the order of messages is guaranteed only within the same partition. This partitioning allows Kafka to scale horizontally as each partition can be hosted on a different server.

You might find useful

Leadership Election Among Brokers

Kafka uses ZooKeeper to handle leadership election among brokers. Each partition has a single leader broker that handles all reads and writes, ensuring consistency and durability of data. The leader election process is crucial for fault tolerance and is managed through a consensus algorithm in ZooKeeper, ensuring that only one leader is active at any time.

Handling Back Pressure

Kafka manages back pressure effectively by allowing consumers to pull data at their own pace rather than pushing it to them. This pull-based model prevents consumers from being overwhelmed by data flow, which is particularly important in systems with highly variable load.

Serialisation: Avro vs. Protobuf

Serialisation frameworks like Avro and Protobuf are crucial in Kafka for efficient data encoding and decoding. Avro supports schema evolution, which allows producers and consumers to understand the data as schemas evolve. Protobuf, known for its efficiency and speed, is more compact and faster than Avro, making it suitable for environments where performance is critical.

Considerations

Kafka's design includes advanced features such as replication, fault tolerance, and flexible consumer groups. Replication across brokers ensures that data is not lost even if a broker fails, while consumer groups allow multiple consumers to work together to process data more efficiently. The configuration of these features can significantly affect Kafka's performance and reliability.

Questions

Here are 40 flashcard-style questions and answers covering some of the confusing and important topics related to Apache Kafka.

What is Apache Kafka?
- Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java.
What type of messaging system is Kafka described as?
- Kafka is often described as a distributed event log.
How does Kafka differ from traditional messaging systems?
- Kafka differs by persisting messages on disk with a configurable retention policy, acting as a hybrid between a messaging system and a database.
What are the main components of Kafka's architecture?
- Producers, consumers, brokers, topics, and partitions.
What role do producers play in Kafka?
- Producers send messages to Kafka topics.
What is the function of a Kafka consumer?
- Consumers read messages from topics they are subscribed to and maintain their position in the stream of data.
What is a Kafka Broker?
- A Kafka Broker is a server in the Kafka cluster that stores data and serves clients.
What is a topic in Kafka?
- A topic is a category or feed name to which records are published by producers.
Explain partitions within a Kafka topic.
- Partitions allow a topic's log to be scaled by splitting the data across multiple brokers. Each partition can be hosted on a different server.
How does Kafka achieve message durability?
- Kafka persists all messages on disk and allows the configuration of replication factors for topics to ensure data is not lost.
What is a retention policy in Kafka?
- A retention policy in Kafka determines how long messages are kept before they are deleted.
What does it mean that Kafka guarantees "at least once" delivery?
- Kafka ensures that messages are delivered at least once to a consumer, but this may result in duplicate messages.
How can message order be guaranteed in Kafka?
- Message order is guaranteed within a single partition, but not across multiple partitions.
What is the role of ZooKeeper in Kafka?
- ZooKeeper manages brokers, maintains their list, and helps in leader election for partitions.
How does Kafka handle failure of a broker?
- Kafka handles broker failures through its replication mechanism; if a leader broker fails, another broker with a replica of the same data can take over as the leader.
What is the significance of the offset in Kafka?
- An offset is a sequential id number of a record within a partition that uniquely identifies each record within that partition.
How does Kafka support high throughput?
- Kafka supports high throughput by batching messages and writing them to disk sequentially, reducing the number of disk seeks.
What is back pressure in Kafka, and how is it handled?
- Back pressure occurs when consumers cannot keep up with the rate of messages being produced. Kafka handles this by allowing consumers to pull messages at their own pace rather than pushing messages to them.
What are consumer groups in Kafka?
- Consumer groups allow multiple consumers to co-ordinate consumption from multiple partitions, ensuring each partition is only consumed by one consumer from the group.
How does Kafka support schema evolution?
- Kafka supports schema evolution through serialization frameworks like Avro, which allows schemas to evolve without breaking existing applications.
What is the purpose of Kafka's partitioning feature?
- Kafka's partitioning feature increases scalability by spreading data across multiple brokers and allowing parallel consumption.
Can you define what a Kafka Cluster is?
- A Kafka cluster is a group of one or more brokers working together to distribute data load and ensure high availability.
What mechanism does Kafka use for fault tolerance?
- Kafka uses data replication across multiple brokers to ensure fault tolerance and data durability.
What is a leader partition in Kafka?
- In each partition, one broker is elected as the leader that handles all reads and writes for that partition, ensuring data consistency.
How does Kafka manage data consistency?
- Kafka ensures data consistency by using a leader for each partition, where all data writes and reads are processed, and followers replicate the leader’s data.
What is the role of replicas in Kafka?
- Replicas are copies of partitions that exist on different brokers to ensure high availability and resilience against broker failures.
Explain the significance of the offset in Kafka's data handling.
- The offset is a unique identifier for each record in a partition, allowing consumers to track their position and ensure they read each message once.
What does Kafka's "exactly once" semantics mean?
- "Exactly once" semantics ensures that messages are delivered once and only once, preventing data duplication in fault-tolerant settings.
How does Kafka handle large data volumes?
- Kafka handles large data volumes by storing messages in a compressed, sequential log format which facilitates efficient data writing and reading.
What is a Kafka Connector?
- Kafka Connectors are ready-to-use components that simplify integrating Kafka with external systems like databases, key-value stores, search indexes, and file systems.
Describe a Kafka Consumer Group.
- A Kafka Consumer Group is a group of consumers that cooperatively consume data from one or more topics, where each partition is consumed by only one member of the group.
What is the default retention time for messages in Kafka?
- By default, Kafka retains messages for seven days, but this can be configured per topic based on time or storage space.
How can you recover from a Kafka broker failure?
- Recovery from a broker failure is managed by Kafka’s replication mechanism, where another broker with a replica of the same data can automatically take over as the leader.
What types of data can Kafka handle?
- Kafka can handle any type of data, as messages are essentially arrays of bytes.
How does Kafka's log compaction feature work?
- Log compaction retains at least the last known value for each key within the log, thus cleaning up obsolete records and preserving state over time.
What are Kafka Streams?
- Kafka Streams is a client library for building applications and microservices where the input and output data are stored in Kafka clusters.
What is the impact of having multiple partitions in a Kafka topic?
- Multiple partitions in a Kafka topic allow for increased parallelism, higher throughput, and better load distribution across the Kafka cluster.
How does Kafka ensure message ordering within a partition?
- Kafka ensures message ordering within a partition by assigning a sequential, immutable offset to each message as it is appended to the log.
Can Kafka be used for real-time processing?
- Yes, Kafka is highly suitable for real-time data processing due to its high throughput and low latency capabilities.
What is the benefit of Kafka's immutability of data?
- The immutability of data in Kafka ensures that once data is written, it cannot be changed, thereby simplifying concurrency and making data reliable and consistent.

I hope this helps a bit.