Kafka’s concept of the high watermark.
The high watermark is an offset representing a record in a partition.
Most simply said - the high watermark is the newest offset of the SLOWEST in-sync follower replica for your partition. 🐢👈
It’s an important concept that denotes how quickly a new message will be available to your consumer clients.
A consumer can never read messages newer than the high watermark, because otherwise it would read unreplicated messages and risk ghost reads.
In other words - the HWM is the lowest offset that is replicated across all N in-sync replicas.
Since it’s only out of the *in-sync* replicas, the high watermark cannot be millions of records behind.
✅ in-sync replica reminder:
A follower gets kicked out of the in-sync replica set if it hasn’t consumed up to the log end offset of the leader in the last replica_lag_time_max_ms seconds. (30 seconds by default)
Or, more strictly, if it loses zookeeper connection for zookeeper_session_timeout_ms. (18 seconds by default)
So, it is expected that the follower reads the latest offset pretty much all the time.
But. It still may not be instant to catch up, especially in cases where the cluster's throughput limit is pushed.
In performance critical cases, the high watermark catch up may add milliseconds to your e2e latency that you'd want to optimize.
An example config you could tweak to minimize the log end offset and high watermark delta could be increasing the number of replica fetcher threads.
A subtle, but important note is that the high watermark gets bumped only on the second fetch request by the follower. 💡
i.e if in request 1 the follower reads offset X, it’s only on receipt of request 2 that the leader bumps the high watermark to X.
Why?
Because that’s the only way for certain to know that the follower received the response from 1 and persisted it successfully.
The leader needs to see the subsequent fetch request requesting records from offset X+.
That’s more or less it!
You just learned an important, core Kafka concept in under 2 minutes!