Kafka Messages and Data Consistency

Kafka Message and Data Consistency

With its scalability and fault tolerance features, Kafak has been becoming more an more popular in large scale, real time enterprise applications. Kafka messages are published to partitions that are usually located on different nodes and consumed by multiple consumers, each of which read messages from a single partition. This raises a data consistency issue due to multiple partitions and consumers. For example, if a security in a trading system is modified twice within a very short time and the messages could be published to two different partitions. As a result, the two messages are processed by two consumers and there is no guarantee that the last message ends up in your application or your data storage. How can this issue be resolved?

Kafka Key With Single Threaded Consumer

Kafka message is published with a key and payload. The messages with the same key are published to the same partition that will be consumed by the same consumer. In this case, the early message is guaranteed to be consumed before the later messages by a single threaded consumer. The important point is that your consumer must be single threaded to prevent the earlier message from being persisted in your data storage. How to design the key for your message is important to avoid hot partition (one partition receives lots of messages while other partitions get very few messages) in Kafka topics. Suppose your system publishes messages of user information, the possible key could be user id plus email as the key. This helps better distribute your messages among kafka partitions.

Kafka Key, Transaction ID/Event Time and Multi-Threaded Consumer

This key based single threaded solution is good for small and medium sized applications. This will not work for a large size application, especially in its peak time. In a large application that needs to handle a large number of messages, the consumers have to be multi-threaded (usually through thread pool). Since the consumer is multi-threaded, there is still chance the first message could end up in your database than the second message. This can be solved with key plus a transaction id like version id or message timestamp. The message timestamp is the time at which the message is generated instead of the time the message is processed.

For example. A user message is composed of payload(userId, userName, versionId) and the kafka message key is userId. When the user modifies his info twice in a very short time, two messages are written into kafka log with versionId 1 and versionId 2. Suppose the user table is as follows,

User( userId int, userName char(64), versionId int).

When the multiple-threaded consumer persists the message to the table, the update statement can be written as a conditional update as follows,

Update User set userName = 'new name' where userId = #userId# and (versionId < #new versionId# )

Since the sql update is an atomic operation, the data consistency is guaranteed with versionId 2 sits in the data storage.

The same result can be achieved with event time. In this case, the table will be like

User( userId int, userName char(64), lastUpdatedTime datetime)

And the sql statement will be like

Update User set userName = 'new name' where userId = #userId# and (lastUpdateTime < #lastUpdateTime# )

Kafka Key, Compaction, and Data Consistency in Data Recovery

Kafka topic can be configured with compaction. Suppose two or more messages with the same key are written to the same partition, the first message will be deleted after a certain amount of time if the topic is enabled with compaction. This is very useful for data recovery in case your kafka consumers crashes or your data consumed from the kafka is corrupted. Because the data is compacted with earlier messages removed from the kafka log, there is no concern for data inconsistency when re-consuming the messages from the compacted topic.

Conclusion

It is a very common issue to keep data consistency in an application involving message producing and consumption. Kafka provides the technique with key and compaction to help solve the issue.

Data Lake

Search This Blog