Apache Kafka
This is an extension of my 2025 Learning Log.
⚠️ This post is currently a work-in-progress as I am still going through the materials
I started learning Apache Kafka. I wanted to study Flink actually but since it comes downstream of Kafka, I figured I might as well learn a bit more about Kafka first.
I am learning through this course The Complete Apache Kafka Practical Guide. The version used in the videos is a bit older and still uses Zookeeper. So I’m going back and forth between this course and other materials I find in the official docs, the interwebs, and/or YouTube.
Kafka has been moving away from using Zookeeper and has introduced the KRaft protocol (Kafka Raft Metadata mode). This eliminates the need for Zookeeper, making Kafka simpler to deploy and manage.
Introduction
Here is an introduction to Kafka from the official docs.
event
- an indication in time that the thing took place
- e.g. represents a thing that happened in the business
- stored in topics
log
- structure that stores events (events are not stored in databases)
- ordered sequence of events
- easy to build at scale, easier than databases
Apache Kafka
- is a system for managing these logs
- think of events first, things second
topics
- ordered sequence of events stored in a durable way (stored in disc, replicated)
- can store topics for short or long time
- topics can be small or enormous
Programs can talk to each other through Kafka topic. Each program consumes a message from a Kafka topic, do computations (e.g. grouping, filtering, enriching the data from another topic), then produce that message into another separate Kafka topic to be likewise durably stored and processed - e.g. by a new service that perform real-time analysis of the data
- enables Kafka to connect with other systems using connectors
- API that handles all framework and infrastructure to process data in Kafka in a scalable and fault-tolerant way
Installation
There are two ways to install Kafka - using the binary, and from docker. Using Docker is fairly straight-forward, and no need to manually install or manage a compatible Java version. I tried the steps in this article which uses a docker compose file.
I also tried using the binary, which requires a prior installation of Java version 17 and up. My machine defaults to version 11 so had to spend some time linking the approriate version which already exists (in my case version 23).
Here’s the rough steps.
Install and unpack
1
2
3
4
5
curl https://dlcdn.apache.org/kafka/4.0.0/kafka_2.13-4.0.0.tgz -o ~/Downloads/kafka.tgz
mkdir kafka
cd kafka
tar -xvzf ~/Downloads/kafka.tgz --strip 1
Install Java
1
2
3
4
5
6
7
8
9
# check java version or if java is present
java -version
# to install
# use apt install
sudo apt install openjdk-11-jdk
# or use brew install
brew install openjdk@23
For existing installation, make sure it’s the one being linked when calling the Java binary from the terminal: I used this Stack overflow troubleshooting as reference
1
2
3
4
5
6
7
8
9
10
sudo ln -sfn /opt/homebrew/opt/openjdk\@23/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk
export JAVA_HOME=`/usr/libexec/java_home -v 23`
echo 'export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
java -version
Then I followed the rest of the steps in the installation/ quick start docs
Generate a Cluster UUID
1
$ KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
Format Log Directories
1
$ bin/kafka-storage.sh format --standalone -t $KAFKA_CLUSTER_ID -c config/server.properties
Start the Kafka Server
1
$ bin/kafka-server-start.sh config/server.properties
Creating a Kafka topic
Use the binaries located in the installation folder e.g. /opt/kafka/bin
. To create a topic, use the script ./kafka-topics.sh
Both the flags --bootstrap-server
and --topic
are required as well as the flag --create
to denote the action. As I’m not using Zookeeper, will not anymore specify the flag for it, and instead use bootstrap-server
.
1
./kafka-topics.sh --create --bootstrap-server localhost:9092 --topic cities
To list existing topics, use the flag --list
1
./kafka-topics.sh --list --bootstrap-server localhost:9092
To show details of a topic, use the flag --describe
1
./kafka-topics.sh --describe --bootstrap-server localhost:9092
output
1
2
Topic: cities TopicId: aQjPyaLhTU-Toc1VxCugYw PartitionCount: 1 ReplicationFactor: 1 Configs:
Topic: cities Partition: 0 Leader: 1Replicas: 1 Isr: 1 Elr: LastKnownElr:
ReplicationFactor=1
- only one server will save the topics
To delete a topic, use the flag --delete
and specify the topic name using --topic
1
./kafka-topics.sh --delete --topic my-topic --bootstrap-server localhost:9092
Producing and consuming messages
Kafka installation comes with a producer called kafka-console-producer.sh
which can be used to send messages to a topic
1
./kafka-console-producer.sh --topic cities --bootstrap-server localhost:9092
There will be prompts for entering messages, example names of cities
1
2
3
4
5
>New York
>Berlin
>Paris
>London
>Sydney
Run a consumer. To do so, open another terminal. Use the built in consumer script kafka-console-consumer.sh
to consume messages from the topic cities
1
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic cities
Adding Delhi
in producer, should return Delhi
to the consumer.
1
Delhi
Adding Dubai
in producer should likewise return Dubai
to the consumer.
1
Dubai
By default, the consumer will return current messages. To show earlier messages, use the flag --from-beginning
. This will show the previous cities including the two newly added.
1
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic cities --from-beginning
results
1
2
3
4
5
6
7
New York
Berlin
Paris
London
Sydney
Delhi
Dubai
Run multiple consumers
Apache Kafka cluster store messages even if they were already consumed by one of the consumers. Same messages maybe read multiple times by different consumers at different moments of time.
Launch another consumer, add Amsterdam
in producer, should show Amsterdam
in the both consumers.
Multiple consumers and multiple producers could exchange messages via single centralized storage point - Kafka cluster
Run multiple producers
Launch a new producer and write Barcelona
consumer 1 and consumer 2 should also show Barcelona
Producers and consumers don’t know about each other. Producers and consumers may appear and disappear. But Kafka doesn’t care about that. It’s job is to store messages and receive or send them on demand.
Stopping one of the producers will not affect the rest. Likewise, stopping a consumer will not affect the others.
Where does Kafka store messages
To find out where Kafka store these messages, find the log path in the server.properties
file
1
cat /opt/kafka/config/server.properties
1
2
...
log.dirs=/tmp/kraft-combined-logs
Access the logs directory
1
cd /tmp/kraft-combined-logs
This would have 50 folders of __consumer_offsets-<number>
from 0-49. These are 50 partitions by system topic called __consumer_offsets
It will also show a folder corresponding to the previously defined topic: cities-0
is single partition only of the defined topic
1
2
3
cd cities-0/
ls
cat 00000000000000000000.log
Kafka stores messages in this *.log
file as text
Kafka doesn’t store all messages forever and after specific amount of time (or when size of the log exceeds configured max size) messages are deleted.
Default log retention period is 7 days (168 hours)
To show the application logs , go to the logs
directory, there will be dated logs inside e.g. /opt/kafka/logs/server.log.2025-03-24-11
Partitions are spread among all available brokers. Every consumer must belong to a consumer group. Every message inside the topic has unique number called “offset”. First message in each topic has offset 0. Consumers start reading messages starting from a specific offset, e.g.
1
2
3
4
5
6
mesagge -------- Offsets
>New York -------- 0
>Berlin ---------- 0
>Paris ----------- 0
>London ---------- 0
>Sydney ---------- 0
Apache Kafka
Apache Kafka is a distributed publish-subscribe messaging system. It stores messages created by Producers, and makes them available to Consumers. Producers and Consumers operate independent of one another.
Distributed means it is a fault-tolerant, resilient system with the ability to create large clusters with many different servers; when one or more servers fail, other servers may continue operation to serve publishers and consumers, and when all are set up correctly even one message will not be lost.
Broker
Producers produce messages to Kafka brokers. Consumers consume messages from Kafka brokers
flowchart LR;
Producer --> Broker --> Consumer
Broker responsibilities include
- receiving messages from Producers
- storing those messages
- giving ability for consumers to consume those message
Brokers store messages into files, Producers append messages to those files, Consumers are able to read from those files
It’s possible to have multiple Producers, and multiple Consumers. Multiple Producers are able to produce multiple messages to Broker, multiple Consumers are able to consume messages from Broker.
flowchart LR
Producer1 --> Broker --> Consumer1
Producer2 --> Broker --> Consumer2
Producer3 --> Broker --> Consumer3
Messages may be produced and consumed asynchronously at different moments of time.
For a single broker system, if the broker fails nothing will be able to produce or consume messages - that’s why it’s best to create broker clusters.
A single Kafka server can run multiple Kafka brokers called a broker cluster
flowchart LR
subgraph BrokerCluster
B["Broker"]
C["Broker"]
D["Broker"]
E["Broker"]
end
Producer1 --> BrokerCluster --> Consumer1
Producer2 --> BrokerCluster --> Consumer2
Producer3 --> BrokerCluster --> Consumer3
Different Producers, and different Consumers are able to interact with different Brokers in the cluster. Every Producer can send messages to different Kafka brokers, and each Kafka broker will store part of the messages - all messages from the Producers will be spread to different servers. Kafka consumers might read messages from different Kafka brokers. If one of the brokers fail, other brokers will take over and continue operation of the clusters.
Zookeeper is responsible for broker synchronization.
Zookeeper
Also used in Apache Hadoop, Apache Solr
flowchart LR
subgraph BrokerCluster
B["Broker"]
C["Broker"]
D["Broker"]
E["Broker"]
end
Zookeeper <--> BrokerCluster
Producer1 --> BrokerCluster --> Consumer1
Producer2 --> BrokerCluster --> Consumer2
Producer3 --> BrokerCluster --> Consumer3
Zookeeper maintains a list of active brokers. It knows, at a given moment, which brokers are active, and which have failed.
It elects a controller among brokers in a cluster; there’s only one controller in each cluster.
Lastly, it manages configuration of the topics and partitions. When a topic is created in the cluster, it is created in Zookeeper, Zookeeper distributes this configuration to all brokers in the cluster
If the Zookeeper in a single Zookeeper system fails, the whole system will be down. Hence, it is also possible to create a cluster of Zookeepers.
Zookeeper ensemble
flowchart LR
subgraph BrokerCluster
B["Broker"]
C["Broker"]
D["Broker"]
E["Broker"]
end
subgraph ZookeeperEnsemble
F["Zookeeper"]
G["Zookeeper"]
H["Zookeeper"]
end
ZookeeperEnsemble <--> BrokerCluster
Producer1 --> BrokerCluster --> Consumer1
Producer2 --> BrokerCluster --> Consumer2
Producer3 --> BrokerCluster --> Consumer3
A cluster of Zookeepers is called a Zookeeper ensemble.
In every Zookeeper cluster, a quorum should be set. Quorum is the minimum quantity of the servers that should be running in order to have an operational cluster. Otherwise, Zookeeper cluster will be down as well as all connected brokers i.e. entire Apache Kafka server will be down as well
It is recommended to have an odd number of servers in Zookeeper ensemble (e.g. 1, 3, 5 , 7 etc) and the quorum set to (n+1)/2
where n is the quantity of servers. This prevents just half of servers managing an actually broken cluster (example a quorum of 2 of 4 servers hosted in a different zone will keep on working even if the other half of the cluster is already inactive. This leads to broken messages).
Multiple Kafka clusters
It is possible to setup several clusters of Apache Kafka. To facilitate synchronization (e.g. when clusters are hosted in different zones), setup mirroring between the different clusters. No mirroring creates completely different clusters in different regions
Default ports of Zookeeper and Broker
The default ports are the following:
- Zookeeper localhost:2181
- Kafka server (broker) localhost:9092
If launching multiple Zookeepers or broker servers in the same computer, assign different ports to each, create different configuration files with different ports, and it’s also a good idea to create separate local folders for every instance
e.g. Zookeeper localhost:2181, :2182, :2183 server (broker) localhost:90921, :9093, :9094
If running Zookeepers and Brokers on different computers, there is no need to change ports
If Brokers should be publicly accessible, advertised.listeners
property in Broker config should be adjusted as localhost is not accessible by outside network.
Kafka Topic
Every topic has its own unique name.
Messages are contained in each topic, and each message has its own unique number called offset.
Producer may append only to the end of the log. Every number / log record is immutable.
The Broker may delete messages from the other side. By default the retention period is 168 hours or 7 days.
Message structure
- Timestamp
- can be assigned by Kafka broker or producer (configurable)
- Offset number
- unique only in a partition in a specific topic
- Key (optional)
- used as additional grouping mechanism of messages inside a topic
- created on producers and sent to Kafka brokers
- if several messages have the same key, they will be sent to the same partition
- Value
- sequence of bytes
- the message body
Keep messages as small as possible to achieve maximum efficiency of Kafka cluster
Topics and partitions
Same topic can be located in different brokers for fault tolerance. E.g. Given topic A in broker 0 and broker 1. If broker 0 fails, topic A is still present in broker 1 and can still accept new messages from producers, and new read requests from consumers.
Topics can have one or more partitions, which can be located in one or more brokers. Different partitions spread across multiple brokers (computers) can write partitions quicker.
Spreading messages across partitions
If topic (e.g. cities) is created with default configuration (single partition), Broker will create a folder e.g. cities-0 for a single partition.
Every partition is a separate folder with files. If there are multiple partitions spread across different brokers, each broker may have one or more folders for each partition e.g. cities-0, cities-1, cities-2,…
If there are multiple partitions on multiple computers, users may write messages into different partitions.
Partition number is from 0.
Offsets are also given to messages in each partition (0,1,2,3,…)
Each partition must have a unique number across a topic even if that topic is in multiple brokers. Offset number can repeat as long as they are in different partitions.
If Broker 1 containing Partition 1 fails, the messages inside that Partition will be lost and will no longer be consumed - unless replication is set up in different brokers.
Partition Leader and Followers
Create a leader and follower partitions when creating replicas.
Followers are partitions that get messages from the Leader partition and replicate them; they don’t accept read or write requests from consumers or producers.
If Leader partition fails, one of the followers will become the new leader, and messages in the failing partition will not be lost.
Plan resources of the brokers accordingly.
Recommended to create at least 2 replicas (3 brokers). Configure replication factor on topic-level basis; by default this is set to 1 (every message is stored once in one broker). Recommended to set to 3 in production.
Falling partitions can be set in addition, which are passive partitions in case of active follower failure.