notes

2024

Impressions from Kafka Summit London 2024

I have been lucky to attend Kafka Summit London this year (thanks, Aiven!) and wanted to share some notes on topics that caught my attention from the sessions I was able to attend:
Read more
kafka-summit

2022

On Anarchism book

On Anarchism by Noam Chomsky
Read more
ideas chomsky

Ingesting log files to sqlite

I recently was looking into how to analyze multiple related log files (e.g. application log from multiple instances), and found that sqlite may be enough :) The first step is ingesting content from log files into sqlite tables. sqlite-utils to the rescue! I was initially happy with having each line as a row and adding full-text support to the log column to query events. However, a Java log may span across multiple lines and the outputs may not be ideal — timestamps could be in 1 line, and the stack trace root cause in another one.
Read more
til sqlite datasette ops troubleshooting

Piggyback on Kafka Connect Schemas to process Kafka records in a generic way

When reading from/writing to Kafka topics, a serializer/deserializer (a.k.a SerDes) is needed to process record key and value bytes. Specific SerDes that turn bytes into specific objects (e.g. POJO) are used, unless a generic JSON object or Avro structure is used. Kafka Connect has to deal with generic structures to apply message transformations and convert messages from external sources into Kafka records and vice-versa. It has a SchemaAndValue composed type that includes a Connect Schema type derived from Schema Registry or JSON Schema included in the payload, and a value object.
Read more
til kafka connect dev

sqlite can be used document and graph database

I found that the use-cases for sqlite keep increasing now the JSON is supported. This week I found the following presentation: https://www.hytradboi.com/2022/simple-graph-sqlite-as-probably-the-only-graph-database-youll-ever-need Which makes the case for a simple graph schema, and using SQL out-of-the-box functionality to store graphs and execute traversal queries. This repository is actually based on this one focused on JSON support and document databases: https://dgl.cx/2020/06/sqlite-json-support
til sqlite database

Explore Kafka data with kcat, sqlite, and Datasette

I have been playing with Datasette and sqlite for a bit, trying to collect and expose data efficiently for others to analyze. Recently started finding use-cases to get data from Apache Kafka, and expose it quickly to analyze it. Why not using Datasette?
Read more
til datasette sqlite kcat kafka dev data

Releasing OS-specific GraalVM native image binaries easier with JReleaser

Packaging and releasing Java applications (e.g. CLI) tend to be cumbersome, and the user-experience tended not to be the best as users have to download a valid version of JRE, etc.
Read more
til java graalvm dev

Changing void returning type in Java methods breaks binary compatibility

While proposing changes to Kafka Streams DSL, I propose changing the return type of one method from void to KStream<KOut, VOut. I was under the (wrong) impression that this change wouldn’t affect users. I was also not considering that applications might just drop a new library without recompiling their application.
Read more
til java compatibility dev

Enable Certificate Revocation on Kafka clusters

Recently I got a question on how to manage revoked SSL certificates in Kafka clusters. With a proper Public Key Infrastructure, a Certificate Revocation List (CRL) can be available for clients to validate if a certificate is still valid regardless of its time-to-live. For instance, if a private key has been compromised, then a certificate can be revoked before it’s valid date.
Read more
til kafka ssl ops security

The Value of Everything book

The Value of Everything by Mariana Mazzucato
Read more
ideas

Kafka Streams FK-join within the same KTable

KTable to KTable foreign-key joins is one of the coolest features in Kafka Streams. I was wondering whether this feature would handle FK-joins between values on the same table.
Read more
til kafka-streams streaming-joins dev

Kafka Streams abstracts access to multiple tasks state stores when reading

Kafka Streams applications could scale either horizontally (add more instances) or vertically (add more threads). When scaled vertically, multiple tasks store multiple partitions locally. An interesting question is whether Kafka Streams gives access when reading (i.e. Interactive Queries) to these stores, and how does it manage to abstract the access to different stores managed by multiple tasks.
Read more
til kafka-streams api dev

reload4j as drop-in replacement for log4j 1.x

TIL there is a drop-in replacement for log4j 1.x: Reload4j.
Read more
til java logging ops

Ansible has a debug mode to pause and troubleshoot

I have been running Ansible for a while now. My usual/naive way of debugging has always been adding a debug module[1], and get the execution running til that point. I figured that there are better ways to deal with this[2]. By using the debug mode, tasks will stop when failing (by default) and you’ll be able to introspect into the task, variables, and context when things failed. Even better, you’ll be able to re-execute if there was a transient error.
Read more
til ansible automation ops

2021

Changing Kafka Broker's rack

Kafka broker configuration includes a rack label to define the location of the broker. This is useful when placing replicas across the cluster to ensure replicas are spread across locations as evenly as possible.
Read more
til kafka deployment ops

Reducing `acks` doesn't help to reduce end-to-end latency

Kafka Producers enforce durability across replicas by setting acks=all (default since v3.0). As enforcing this guarantee requires waiting for replicas to sync, this increases latency; and reducing it tends to give the impression that latency gets reduced overall.
Read more
til kafka latency ops

Kafka Producer idempotency is enabled by default since 3.0

Since Apache Kafka 3.0, Producers come with enable.idempotency=true which leads to acks=all, along with other changes enforced by idempotency. This means by default Producers will be balanced between latency (no batching) and durability — different from previous versions where the main goal was to reduce latency even by risking durability with acks=1.
til kafka latency performance-tuning

The Age Of Surveillance Capitalism book

The Age Of Surveillance Capitalism by Shoshana Zuboff
Read more
ideas

Scale book

Scale: The Universal Laws of Growth, Innovation, Sustainability, and the Pace of Life in Organisms, Cities, Economies, and Companies by Geoffrey B. West
Read more
ideas

Cioran books

Including pictures from books: The Trouble With Being Born A Short History Of Decay
Read more
cioran philosophy

Use min.insync.replicas for fault-tolerance

Things to remember: Topic replication factor is not enough to guarantee fault-tolerance. If min.insync.replicas is not defined i.e. 1, then data could potentially be lost. acks=all will force replica leader to wait for all brokers in the ISR, not only the min.insync.replicas. If replicas available are equal to minimum ISR, then the topic partitions are at the edge of losing availability. If one broker becomes unavailable (e.g. restarting), then producers will fail to write data. Topic configuration is inherited from the server. If broker configuration changes, it affects the existing topics. Keep the topic defaults, unless it needs to be different than broker default for easier maintenance.
Read more
til kafka fault-tolerance ops

About TILs

Today I learned about Today-I-Learned posts from Simon Willison: https://til.simonwillison.net/ and found it super cool, so I decided to try out, let’s see how it goes.
til

2019

Notes on Co-evolving Tracing and Fault Injection with Box of Pain

This paper explores how related tracing and fault injection systems are, and if they should be part of the same thing. The space of possible executions of a distributed system is exponential in the number of communicating precesses and the number of messages, […] […] some of the most pernicious bugs in distributed programs involve mistakes on how programs handle partial failure of remote components. In order to expose this failures, fault injection mechanisms are used to cause network partitions, or machine crashes.
Read more
papers distributed systems tracing fault injection peter alvaro daniel bittman ethan l miller

2018

Notes on Kafka, Samza and the Unix Philosophy of Distributed Data

 From Batch to Streaming workflows Key properties for large-scale systems: [Large-Scale Personalized Services] should have the following properties: System scalability Organizational scalability Operational robustness Where Batch jobs have been successfully used, and represent a reference model to improve from: [Batch, Map-Reduce jobs] has been remarkably successful tool for implementing recommendation systems. [Batch important benefits:] Multi-consumer: several jobs reading input directories without affecting each others. Visibility: job’s input and output can be inspected for tracking down the cause of an error.
Read more
notes kafka samza

Data on the Outside vs Data on the Inside

I found this paper as relevant and accurate today as it was in 2005, when it was published. It is fascinating how even 12 years later and with new technologies in vogue, same concepts keep applying.
Read more
papers distributed systems microservices pat helland transactions