ops

Ingesting log files to sqlite

Posted on Sep 24, 2022

I recently was looking into how to analyze multiple related log files (e.g. application log from multiple instances), and found that sqlite may be enough :) The first step is ingesting content from log files into sqlite tables. sqlite-utils to the rescue! I was initially happy with having each line as a row and adding full-text support to the log column to query events. However, a Java log may span across multiple lines and the outputs may not be ideal — timestamps could be in 1 line, and the stack trace root cause in another one.

til sqlite datasette ops troubleshooting

Getting started with Kafka quotas

Posted on May 11, 2022

Kafka quotas have been around for a while since initial versions of the project — though not necessarily being enabled in most deployments that I have seen. This post shares some thoughts on how to start adopting quotas and gives some practical advice, and a bit of the history of quotas in the Kafka project.

kafka ops

Enable Certificate Revocation on Kafka clusters

Posted on Feb 9, 2022

Recently I got a question on how to manage revoked SSL certificates in Kafka clusters. With a proper Public Key Infrastructure, a Certificate Revocation List (CRL) can be available for clients to validate if a certificate is still valid regardless of its time-to-live. For instance, if a private key has been compromised, then a certificate can be revoked before it’s valid date.

til kafka ssl ops security

reload4j as drop-in replacement for log4j 1.x

Posted on Jan 25, 2022

TIL there is a drop-in replacement for log4j 1.x: Reload4j.

til java logging ops

Ansible has a debug mode to pause and troubleshoot

Posted on Jan 21, 2022

I have been running Ansible for a while now. My usual/naive way of debugging has always been adding a debug module[1], and get the execution running til that point. I figured that there are better ways to deal with this[2]. By using the debug mode, tasks will stop when failing (by default) and you’ll be able to introspect into the task, variables, and context when things failed. Even better, you’ll be able to re-execute if there was a transient error.

til ansible automation ops

Changing Kafka Broker's rack

Posted on Dec 10, 2021

Kafka broker configuration includes a rack label to define the location of the broker. This is useful when placing replicas across the cluster to ensure replicas are spread across locations as evenly as possible.

til kafka deployment ops

Kafka data loss scenarios

Posted on Dec 10, 2021

Kafka topic partitions are replicated across brokers. Data loss happens when the brokers where replicas are located are unavailable or have fully failed. The worst scenario — and where is no much to do — is when all the brokers fail; then no remediation is possible. Replication allows to increase redundancy so this scenarios is less likely to happen. The following scenarios show different trade-offs that could increase the risk of lossing data:

kafka durability ops

Reducing `acks` doesn't help to reduce end-to-end latency

Posted on Dec 9, 2021

Kafka Producers enforce durability across replicas by setting acks=all (default since v3.0). As enforcing this guarantee requires waiting for replicas to sync, this increases latency; and reducing it tends to give the impression that latency gets reduced overall.

til kafka latency ops

Use min.insync.replicas for fault-tolerance

Posted on Dec 2, 2021

Things to remember: Topic replication factor is not enough to guarantee fault-tolerance. If min.insync.replicas is not defined i.e. 1, then data could potentially be lost. acks=all will force replica leader to wait for all brokers in the ISR, not only the min.insync.replicas. If replicas available are equal to minimum ISR, then the topic partitions are at the edge of losing availability. If one broker becomes unavailable (e.g. restarting), then producers will fail to write data. Topic configuration is inherited from the server. If broker configuration changes, it affects the existing topics. Keep the topic defaults, unless it needs to be different than broker default for easier maintenance.

til kafka fault-tolerance ops

@jeqo

Ops

Ingesting log files to sqlite

Getting started with Kafka quotas

Enable Certificate Revocation on Kafka clusters

reload4j as drop-in replacement for log4j 1.x

Ansible has a debug mode to pause and troubleshoot

Changing Kafka Broker's rack

Kafka data loss scenarios

Reducing `acks` doesn't help to reduce end-to-end latency

Use min.insync.replicas for fault-tolerance