Ingesting log files to sqlite

Posted on
til sqlite datasette ops troubleshooting

I recently was looking into how to analyze multiple related log files (e.g. application log from multiple instances), and found that sqlite may be enough :)

The first step is ingesting content from log files into sqlite tables. sqlite-utils to the rescue! I was initially happy with having each line as a row and adding full-text support to the log column to query events. However, a Java log may span across multiple lines and the outputs may not be ideal — timestamps could be in 1 line, and the stack trace root cause in another one.

I found (thanks @simonw!) that sqlite-utils supports adding “convert” functions when inserting data from a file into sqlite, allowing to apply custom parsing to either lines or the whole text file: https://sqlite-utils.datasette.io/en/stable/cli.html#applying-conversions-while-inserting-data

The next challenge was: how to parse log files content with regular expressions? After some try and error, I got into the problem of how to apply certain expression only if the next line starts with some character, but without “consuming” the next character/line. I got to learn about regex negative (and positive) lookahead (or behind!) expressions that do just that!

From regex101.com:

(?!...): Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.

So, in ended up with the following expression:

(?m)^\[(?P<datetime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3})\] (?P<level>\w+) (?P<log>(.+$(\n(?!\[).+|)+))

Explanation: https://regex101.com/r/2AxwfE/1

And the final script to ingest log files into sqlite database looks like this:

sqlite-utils insert /tmp/kafka-logs.db logs server.log.2022-09-24-21 --text --convert "
import re
r = re.compile(r'^\[(?P<datetime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3})\] (?P<level>\w+) (?P<log>(.+(\n(?\!\[).+|)+))', re.MULTILINE)
def convert(text):
    rows = [m.groupdict() for m in r.finditer(text)]
    for row in rows:
        row.update({'server': 'localhost'})
        row.update({'component': 'broker'})
    return rows
"

The row.update allows to label rows as I’m planning to ingest logs from different hosts and potentially different components.

Additionally, full-text search support can be added to the log column to allow custom searches:

sqlite-utils enable-fts /tmp/kafka-logs.db logs log