If you want to ingest huge quantities of data from an outside feed, Apache Kafka's a great choice. But you won't get optimum performance out of the box. So let's write some code, hook up to a 3rd party data source, and take a look at performance-tuning it.
For this video, the data source we're looking at is the Github firehose. A real-time stream of all the public events going through Github's API. It won't stress Kafka too much, but it will easily put pressure on the default configuration.
We'll walk through the Python code needed to connect Github's Server-Sent Events (SSE) feed to a Kafka topic, and then look at the most important parameters to tune, what they actually mean, and how to choose some sensible values. Most importantly, we'll look at how to measure and understand the results, because with the right understanding and the right values, you can easily improve your Kafka producers' performance by a couple of orders of magnitude.
These techniques apply whether you're using our Quix Streams library, the Confluent Kafka Python library, or any Kafka library that's built on top of librdkafka.
--
Source Code:
https://github.com/quixio/simple-kafka-python/tree/main/github-firehose
The GitHub Firehose:
https://github-firehose.libraries.io/
Quix Streams Docs:
https://quix.io/docs/quix-streams/introduction.html
Requests SSE on PyPi:
https://pypi.org/project/requests-sse/
Quix Streams on PyPi:
https://pypi.org/project/quixstreams/
Quix CLI:
https://github.com/quixio/quix-cli
--
0:00 Intro
1:06 Project Setup
2:59 Reading Github Events With Python
5:09 Improving Python's JSON Output
6:06 Ignoring KeyboardInterrupt
6:33 Understanding the Github Event Structure
7:07 The Easy Way to Set Up a Local Kafka Instance
8:05 Connecting Python to Kafka
10:44 Verifying Kafka Data with kcat
11:13 Measuring Kafka Producer Performance
14:05 Kafka's Batching Strategy
15:20 Low-Level Producer Debugging
16:13 Improving Batching with linger.ms
19:23 Tuning Batching with batch.size
21:49 Enabling Compression
24:14 Choosing the Compression Algorithm
26:06 Summarising Kafka Producer Tuning
26:47 Outro