Skip to content

Kafka - Introduction

Kafka-Arch

Introduction

Apache Kafka is an open-source, distributed event streaming platform that is widely used for building real-time data pipelines and streaming applications. It allows for the storage and processing of large amounts of data in a fault-tolerant and scalable way.

Kafka is a publish-subscribe system, which means that it allows for the separation of data producers and data consumers. Producers write data to topics, and consumers read from those topics. This decoupling of producers and consumers allows for a high degree of flexibility and scalability.

One of the key features of Kafka is its ability to handle large amounts of data in real-time. It achieves this by using a distributed architecture, where data is replicated across multiple nodes for fault tolerance and scalability. This also allows for the ability to handle high throughput and low latency.

Another important feature of Kafka is its ability to handle data streams in a fault-tolerant way. If a node in a Kafka cluster goes down, the data is still available and can be replayed to other nodes. This makes it well-suited for mission-critical applications that require high availability.

Kafka also provides a variety of built-in tools for data management and processing, such as Kafka Connect and Kafka Streams. These tools allow for easy integration with other systems and easy processing of data streams.

Kafka is used by many companies and organizations for a variety of use cases, such as real-time analytics, data integration, and event-driven applications. Some examples of companies that use Kafka include LinkedIn, Netflix, and Uber.

Overall, Kafka is a powerful and flexible platform for building real-time data pipelines and streaming applications. Its ability to handle large amounts of data in real-time, its fault-tolerance, and its built-in data management and processing tools make it a valuable tool for many different use cases.

Kafka Architecture

Within Kafka, each unit of data in the stream is called a message. Messages could be clickstream data from a web app, point-of-sale data for a retail store, user data from a smart device, or any other events that underlie your business. Applications that send the message stream into Kafka are called producers. Kafka servers, called brokers, receive the stream and write the messages sequentially to immutable log files.

Messages with similar traits may be categorized into groups called topics. Applications called consumers subscribe to topics and process the messages. You might be familiar with some of this terminology if you’ve used a traditional messaging system or publish-subscribe system.

The key components of Kafka include:

  • Producer: Producers publish (write) messages to a Kafka topic.
  • Consumer: Consumers/subscribers subscribe to topics and process (read) the feed of published messages.
  • Broker: Brokers are Kafka servers that store data and serve clients. Multiple brokers form a cluster.
  • Topic: Topics are a feed name or message category to which messages are published by producers.
  • Partition: Messages are organized into topics, which can then be further divided into partitions. This increases parallelism and scalability allowing consumers to read different partitions at the same time.

What are the Advantages of Kafka?

Kafka has some key advantages, primarily in its reliability, scalability, and speed. Below, we will explore each of these advantages.

Kafka Reliability

In a traditional messaging or pub-sub system, the producer sends a message to a queue where it waits for a consumer service to read it. The message is then removed from the queue. This design has some shortcomings. For example, there’s no way to recover messages if the consumer service fails. By contrast, Kafka is a persistent log-based message queue. In this type of system, brokers store incoming messages in the order they are received.

Kafka uses an offset to bookmark a consumer’s progress in processing data, so if the service fails, it can easily come back to where it left off without duplicating any effort. As a consumer reads messages and the offset moves, older, read messages are persisted to disk for later analysis by other consumers. Each consumer has its own offset, so multiple consumers can independently read from the same data stream. Kafka also uses a unique system of partitioning and replicating messages to support highly-reliable data streaming.

Let’s look at a real-world example: a sales data stream for a retailer. The retailer’s point-of-sale machines generate an event for each transaction. Producer applications continuously feed these sales messages into Kafka. A Kafka broker writes the messages to a topic called “sales,” but instead of every sale being written to a single log file, the topic is partitioned based on a key such as the city in which the sale occurred. The retailer configures the producer to route messages according to this key. This ensures that the brokers write sales from the same city to a specific partition. Each partition gets replicated across a cluster or group of multiple Kafka brokers. Within each partition, one broker acts as the leader and the remaining brokers are followers. The leader handles all the read and write requests for the partition, but if the leader goes down, a follower automatically takes over as the leader. With this fail-safe mechanism, you can reliably stream and store data without having to worry about routine outages.

The consumer side of Kafka also has a fail-safe mechanism. You can create multiple instances of a consumer application to read messages from the same topic. Together, these instances make up a consumer group. Each partition is assigned to one consumer in the group. In our example, one consumer might process sales from City A and another consumer might process sales from City B. Businesses can choose extra consumers to a group. These excess consumers sit idle but will seamlessly take over data processing if any of the active consumers fails. Thanks to the offset, a fallback consumer will know where to start reading incoming messages.

Kafka Scalability

Distributing a Topic’s partitions across many Brokers allows the Topic to scale well beyond any single host. One cluster of Kafka Brokers can host multiple topics, allowing you to scale several unique data streams. Developers can specify the number of partitions in each topic, and Kafka will automatically assign the partitions to existing brokers in the cluster, allowing easy scalability.

Kafka also enables consumer applications to process data at scale. Adding consumer instances to a group increases your processing capacity. Kafka brokers will automatically load-balance partitions among the consumer group, so a topic can be processed at scale. In addition, since multiple Kafka consumers can read data in parallel, you can quickly get different types of business insights. In our example from before, the retailer could use inventory management software and CRM software to process the same sales data at the same time.

Kafka Performance

Another advantage of Kafka is speed. Kafka can stream, store, and process millions of reads and writes every second. Kafka was designed for low latency and can be optimized for throughput by batching and compressing messages. Additionally, the Kafka fail-safe mechanism mentioned before helps keep data pipelines running smoothly.

How to Monitor Kafka in Production

Kafka leads a good deal of flexibility to developers. For example, after a consumer application processes streams of data, you can feed that data back into Kafka for consumption by other applications. In other words, the consumer of a data stream becomes the producer of another data stream. Hundreds or even thousands of derivative data streams can build on each other.

If your business generates large volumes of data, you can use Kafka to unlock interesting real-time business insights from your data with very little overhead. Companies that adopt Kafka often end up creating complex data pipelines to connect multiple streams of data together. That is the power of Kafka, but this complexity can make it challenging to manage and monitor Kafka deployment. If Kafka is the backbone of your business’s mission-critical data-driven applications—as it often is—you need to continuously monitor your Kafka deployment to become aware of issues before your users are impacted.

Kafka-broker

Reference

  • https://kafka.apache.org/
  • https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/kafka.html
  • https://www.datadoghq.com/knowledge-center/apache-kafka/
Leave a message