Explore Kafka data with kcat, sqlite, and Datasette

Posted on
til datasette sqlite kcat kafka dev data

I have been playing with Datasette and sqlite for a bit, trying to collect and expose data efficiently for others to analyze. Recently started finding use-cases to get data from Apache Kafka, and expose it quickly to analyze it. Why not using Datasette?

Today I learned that kcat is able to consume a set of messages based on timestamp. This means, we can ask “get me all the data produced til now”:

kafkacat -b ${CCLOUD_BOOTSTRAP_SERVER} \
  -C -t confluent-audit-log-events \
  -o e@$(date +%s000)
...
% Reached stop timestamp for topic confluent-audit-log-events [0] at offset 80957
% Reached stop timestamp for topic confluent-audit-log-events [1] at offset 80407
% Reached stop timestamp for topic confluent-audit-log-events [2] at offset 80810: exiting

Datasette has an awesome tool to turn JSON into sqlite table: https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-newline-delimited-json. Then JSON outputs from kcat can be piped into sqlite-utils to produce a sqlite database, and complete when end timestamp is reached:

kafkacat -b ${CCLOUD_BOOTSTRAP_SERVER} \
  -C -t confluent-audit-log-events \
  -o e@$(date +%s000) | \
  sqlite-utils insert auditlog-v1.db audit-log - --nl
% Reached stop timestamp for topic confluent-audit-log-events [0] at offset 80957
% Reached stop timestamp for topic confluent-audit-log-events [1] at offset 80407
% Reached stop timestamp for topic confluent-audit-log-events [2] at offset 80810: exiting

Once data is available on your sqlite database file, use datasette to explore the data with SQL:

datasette auditlog-v1.db

sqlite-utils has a bunch of commands to optimize and tweak the data structure to simplify querying.

And remember sqlite already supports JSON expressions to query nested fields: https://www.sqlite.org/json1.html#jptr