Mastering Data Streams: A Guide to Running Kafka Seamlessly on Kubernetes

August 13, 2023by Dhawal

What is Kafka?

Apache Kafka is a distributed open-source powerhouse, Kafka drives seamless message exchange while powering real-time streaming pipelines and applications, propelling data from source to destination with unmatched agility.

 

Apache Kafka’s Core Advantages Simplified features 

  1. Instant Latency: Kafka’s rapid end-to-end latency, even for vast data streams, ensures quick data retrieval and seamless consumption, thanks to message decoupling.
  2. Streamlined Real-Time Messaging: Kafka’s knack for message decoupling and effective storage enables real-time data publishing, subscription, and processing.
  3. Scalability Reinvented: Kafka’s distributed architecture effortlessly handles surges in data speed and volume, ensuring smooth scaling with no downtime.
  4. Reliability Beyond Compare: Kafka’s data duplication and distribution strategy guarantees data resilience, making it fault-tolerant and reliable.
  5. Integration Prowess: Kafka harmonizes with various data-processing frameworks like Spark, Storm, Hadoop, and AWS, enriching real-time data pipelines with seamless integration.

Apache Kafka – Cluster Architecture

 

Kafka Cluster Architecture Components:

Broker: The Communication Hub: A broker acts as a server hosting Kafka, facilitating communication between various services. Multiple brokers collaborate to form a Kafka cluster.

Event: Bytes of Wisdom: Events are the messages flowing in and out of the Kafka broker. These messages are stored as bytes in the broker’s disk storage.

Producer and Consumer: The Messengers: Producers create events in Kafka, while consumers retrieve them. Interestingly, a service might handle both roles.

Topic: The Organized Repository: Topics categorize events, akin to folders in a file system. Each holds messages of a specific type, like “payment-details” or “user-details.”

Partition: Dividing the Load: A topic can be divided into partitions, optimizing throughput. Partitions are the smallest storage units, containing subsets of data.

Replication Factor: Backup Brigade: Replicas are backups of partitions. A topic’s replication factor determines the number of copies for each partition. For instance, a topic with a partition and replication factor of 2 results in two identical partition copies stored in the cluster.

Offset: offset is Kafka’s consumption marker, like a bookmark in a story. It tracks consumed events, helping consumers pick up where they left off, even after interruptions.

Zookeeper: Kafka’s Co-Pilot and Protector, Zookeeper, an integral companion within Kafka, manages cluster ACLs, stores offsets, monitors broker status, and oversees client quotas—ensuring Kafka’s harmony and security.

Consumer Group: Unity for Efficient Consumption, In Kafka’s world, consumers unite as a Consumer Group to collectively consume messages from designated topics. Grouped consumers each handle distinct partitions, preventing message repetition. This collaboration enhances consumption speed, especially when multiple consumers tackle shared topics.

Installation

K8s Deployment Steps.
  1. Make project directory with any name and go inside the directory, for example we are using kafka-deployment

mkdir kafka-deployment

cd kafka-deployment

2. Download the strimzi operator 0.29 zip link

wget https://github.com/strimzi/strimzi-kafka-operator/releases/download/0.29.0/strimzi-0.29.0.zip

3. Unzip the strimzi operator file and go inside the folder.

unzip strimzi-0.29.0.zip

cd strimzi-0.29.0/

4. Edit the Strimzi installation files to use the namespace the Cluster Operator is going to be installed into. For example, in this procedure the Cluster Operator is installed into the namespace kafka

sed -i 's/namespace: .*/namespace: kafka/' install/cluster-operator/*RoleBinding*.yaml

5. Edit the install/cluster-operator/060-Deployment-strimzi-cluster-operator.yaml file and set the STRIMZI_NAMESPACE environment variable to all namespaces (*) my-kafka-project to give permission to the Cluster Operator to watch all the namespaces.

# ...     
env:     
- name: STRIMZI_NAMESPACE    
  value: "*"    
# …

6. Create ClusterRoleBindings that grant cluster-wide access for all namespaces to the Cluster Operator.

kubectl create clusterrolebinding strimzi-cluster-operator-namespaced --clusterrole=strimzi-cluster-operator-namespaced --serviceaccount kafka:strimzi-cluster-operator

kubectl create clusterrolebinding strimzi-cluster-operator-watched --clusterrole=strimzi-cluster-operator-watched --serviceaccount kafka:strimzi-cluster-operator

kubectl create clusterrolebinding strimzi-cluster-operator-entity-operator-delegation --clusterrole=strimzi-entity-operator --serviceaccount kafka:strimzi-cluster-operator

7.Deploy the Cluster Operator to your Kubernetes cluster.

Go into the strimzi-0.29.0 and fire the below command

kubectl create -f install/cluster-operator/ -n kafka

8. Download the deployment file of kafka and edit the deployment file as per your requirement(increase pods,memory,advertisedPort).After that run the yaml file using the below command

kubectl apply -f kafka-sample.yaml -n <namespace>

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafkaExporter:
    topicRegex: ".*"
    groupRegex: ".*"
  kafka:
    config:
      default.replication.factor: 3
      fetch.message.max.bytes: 10485760
      inter.broker.protocol.version: "3.0"
      max.message.bytes: 10485760
      message.max.bytes: 10485760
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      replica.fetch.max.bytes: 10485760
      transaction.state.log.min.isr: 2
      transaction.state.log.replication.factor: 3
    listeners:
      - name: external
        port: 9094
        tls: false
        type: nodeport
        configuration:
          brokers:
            - broker: 0
              advertisedHost: example.com
              advertisedPort: 5002
            - broker: 1
              advertisedHost: example.com
              advertisedPort: 5003
            - broker: 2
              advertisedHost: example.com
              advertisedPort: 5004
    replicas: 4
    resources:
      requests:
        memory: 500Mi
        cpu: 400m
      limits:
        memory: 800Mi
        cpu: 600m
    storage:
      type: jbod
      volumes:
        - deleteClaim: false
          id: 0
          size: 5Gi
          type: persistent-claim
    version: 3.0.0
  zookeeper:
    replicas: 3
    resources:
      requests:
        memory: 200Mi
        cpu: 200m
      limits:
        memory: 400Mi
        cpu: 400m
    storage:
      deleteClaim: false
      size: 1Gi
      type: persistent-claim

9. Verify kafka namespaces all pods are running are not.

kubectl get all -n kafka

Connect through kafka database

Producer connection:

1.Execute the following command to create a producer and send a text message.

kubectl run -n <namespace> kafka-producer -ti --image=quay.io/strimzi/kafka:0.27.0-kafka-2.8.0 --rm=true --restart=Never -- bin/kafka-console-producer.sh --bootstrap-server kafka-kafka-external-bootstrap.<namespace>.svc.cluster.local:9094 --topic <topic-name>

Consumer connection: 

1. In another terminal run the command below to create a consumer, that will receive messages from a topic.

kubectl run -n <namespace> kafka-consumer -ti --image=quay.io/strimzi/kafka:0.27.0-kafka-2.8.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server kafka-kafka-external-bootstrap.<namespace>.svc.cluster.local:9094 --topic <topic-name> --from-beginning

Finally we are getting live streaming messages. if produces sent messages at same time the consumer also receiver.