vearutop / flatjsonl

A tool to flatten JSONL into CSV or SQL

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

flatjsonl

Build Status Coverage Status GoDevDoc Time Tracker Code lines Comments

flatjsonl renders structured logs as table.

Why?

Logs, structured as JSON Lines (and sometimes prefixed with non-JSON message), are very common source of information for ad-hoc analytics and investigations.

They can be processed with jq and grepped for a variety of data checks, however there are much more powerful and convenient tools that operate on rows and columns, rather than hierarchical structures.

This tool converts structured logs into tabular data (CSV, SQLite, PostgreSQL dump) with flexible mapping options.

Performance

Logs of busy systems tend to be large, so performance is important if you want the job done in reasonable time.

Thanks to github.com/valyala/fastjson, github.com/puzpuzpuz/xsync and concurrency-friendly design, flatjsonl can leverage multicore machines to a large extent and crunch data at high speed.

vearutop@bigassbox ~ $ time ~/flatjsonl -pg-dump ~/events.pg.sql.gz -input ~/events.log -sql-table events -progress-interval 1m
scanning keys...
scanning keys: 100.0% bytes read, 11396506 lines processed, 200806.2 l/s, 902.3 MB/s, elapsed 56.75s, remaining 0s, heap 44 MB
lines: 11396506 , keys: 310
flattening data...
flattening data: 20.7% bytes read, 2363192 lines processed, 39385.9 l/s, 177.0 MB/s, elapsed 1m0s, remaining 3m49s, heap 569 MB
flattening data: 41.7% bytes read, 4750006 lines processed, 39583.1 l/s, 177.9 MB/s, elapsed 2m0s, remaining 2m47s, heap 485 MB
flattening data: 62.7% bytes read, 7140289 lines processed, 39668.1 l/s, 178.3 MB/s, elapsed 3m0s, remaining 1m47s, heap 610 MB
flattening data: 83.6% bytes read, 9528709 lines processed, 39702.9 l/s, 178.4 MB/s, elapsed 4m0s, remaining 47s, heap 572 MB
flattening data: 100.0% bytes read, 11396506 lines processed, 39692.4 l/s, 178.4 MB/s, elapsed 4m47.12s, remaining 0s, heap 508 MB
lines: 11396506 , keys: 310

real    5m44.002s
user    53m24.841s
sys     1m1.772s
51G  events.log
3.6G events.pg.sql.gz

How it works?

In simplest case this tool iterates log file two times, first pass to collect all available keys and second pass to actually fill the table with already known keys (columns).

During each pass, each line is decoded and traversed recursively. Keys for nested elements are declared with dot-separated syntax (same as in jq), array indexes are enclosed in [x], e.g. .deeper.subProperty.[0].foo.

String values are checked for JSON contents and are also traversed if JSON is found (with -extract-strings flag).

If includeKeys is not empty in configuration file, first pass is skipped.

Install

go install github.com/vearutop/flatjsonl@latest
$(go env GOPATH)/bin/flatjsonl --help

Or download binary from releases.

Linux AMD64

wget https://github.com/vearutop/flatjsonl/releases/latest/download/linux_amd64.tar.gz && tar xf linux_amd64.tar.gz && rm linux_amd64.tar.gz
./flatjsonl -version

Macos Intel

wget https://github.com/vearutop/flatjsonl/releases/latest/download/darwin_amd64.tar.gz && tar xf darwin_amd64.tar.gz && rm darwin_amd64.tar.gz
codesign -s - ./flatjsonl
./flatjsonl -version

Macos Apple Silicon (M1, etc...)

wget https://github.com/vearutop/flatjsonl/releases/latest/download/darwin_arm64.tar.gz && tar xf darwin_arm64.tar.gz && rm darwin_arm64.tar.gz
codesign -s - ./flatjsonl
./flatjsonl -version

Usage

flatjsonl -help
Usage of flatjsonl:
  -add-sequence
        Add auto incremented sequence number.
  -buf-size int
        Buffer size (max length of file line) in bytes. (default 10000000)
  -case-sensitive-keys
        Use case-sensitive keys (can fail for SQLite).
  -concurrency int
        Number of concurrent routines in reader. (default 24)
  -config string
        Configuration JSON value, path to JSON5 or YAML file.
  -csv string
        Output to CSV file (gzip encoded if ends with .gz).
  -dbg-cpu-prof string
        Write CPU profile to file.
  -dbg-loop-input-size int
        (benchmark) Repeat input until total target size reached, bytes.
  -dbg-mem-prof string
        Write mem profile to file.
  -extract-strings
        Check string values for JSON content and extract when available.
  -field-limit int
        Max length of field value, exceeding tail is truncated, 0 for unlimited.
  -get-key string
        Add a single key to list of included keys.
  -input string
        Input from JSONL files, comma-separated.
  -key-limit int
        Max length of key, exceeding tail is truncated, 0 for unlimited.
  -match-line-prefix string
        Regular expression to capture parts of line prefix (preceding JSON).
  -max-lines int
        Max number of lines to process.
  -max-lines-keys int
        Max number of lines to process when scanning keys.
  -mem-limit int
        Heap in use soft limit, in MB. (default 1000)
  -offset-lines int
        Skip a number of first lines.
  -output string
        Output to a file (default <input>.csv).
  -pg-dump string
        Output to PostgreSQL dump file.
  -progress-interval duration
        Progress update interval. (default 5s)
  -raw string
        Output to RAW file (column values are written as is without escaping, gzip encoded if ends with .gz).
  -raw-delim string
        RAW file column delimiter.
  -replace-keys
        Use unique tail segment converted to snake_case as key.
  -show-keys-flat
        Show all available keys as flat list.
  -show-keys-hier
        Show all available keys as hierarchy.
  -show-keys-info
        Show keys, their replaces and types.
  -skip-zero-cols
        Skip columns with zero values.
  -sql-max-cols int
        Maximum columns in single SQL table (SQLite will fail with more than 2000). (default 500)
  -sql-table string
        Table name. (default "flatjsonl")
  -sqlite string
        Output to SQLite file.
  -verbosity int
        Show progress in STDERR, 0 disables status, 2 adds more metrics. (default 1)
  -version
        Show version and exit.

Configuration file

includeKeys:
  - ".key1"
  - ".key2"
  - "const:my-value"
  - ".keyGroup.[0].key3"
includeKeysRegex:
  - ".keyGroup.[1].*"
excludeKeys:
  - ".keyGroup.[1].notNeeded"
excludeKeysRegex:
  - ".keyGroup.*.notNeeded"
replaceKeys:
  ".key1": key1
  ".key2": created_at
parseTime:
  "._prefix.[1]": 2006/01/02 15:04:05.99999
outputTimeFormat: '2006-01-02 15:04:05'
outputTZ: UTC
concatDelimiter: "::"
extractValuesRegex:
  ".foo.link": "URL"
  ".*.nested": "JSON"

Parse time is a map of original key to time pattern. See https://pkg.go.dev/time#pkg-constants for pattern rules.

Output time format is used to write parsed timestamps.

List of includeKeys can also declare columns with constant values in form of "const:<value>", <value> would be used as column value.

Configuration file can also have regexp replaces as a map of regular expression as keys and replace patterns as values.

It is also possible to use simplified syntax with *, where * means key segment (can not start with a digit) between two dots.

{
  "replaceKeysRegex": {
    "^\\.foo\\.([^.]+)$": "f00_${1} VARCHAR(255)",
    ".foo.*.*": "f00_${2}_${1} VARCHAR(255)"
  }
}

This example would produce such transformation.

.foo.bar => f00_bar VARCHAR(255)
.foo.baz.qux => f00_qux_baz VARCHAR(255)

Regular expression replaces are applied to keys that have no matches in replaceKeys.

Regular expressions are checked in no particular order, when replaced key is different from original checks are stopped and replaced key is used.

Multiple regular expression could match and replace a key, this can lead to undefined behavior, to avoid it is recommended to use mutually exclusive expressions and match against full key by having ^ and $ at the edges of exp.

If multiple keys are replaced into similar key, coalesce function is used for resulting column value, or if concatDelimiter is defined those values would be concatenated.

Transposing data

In cases of dynamic arrays or objects, you may want to transpose the values as rows of separate tables instead of columns of main table.

This is possible with transpose configuration file field (example), it accepts a map of key prefixes to transposed table name. During processing, values found in the prefixed keys would be moved as multiple rows in transposed table.

Extracting data from strings

With extractValuesRegex config parameter, you can set a map of regexp matching key name to value format. Currently URL and JSON are supported as formats. The string values in the matching keys would be decoded and exposed as JSON.

Examples

Import data from events.jsonl as columns described in events.json config file to SQLite table events in file report.sqlite.

flatjsonl -sqlite report.sqlite -sql-table events -config events.json events.jsonl

Show flat list of keys found in first 100 (or less) lines of events.jsonl.

flatjsonl -max-lines 100 -show-keys-flat events.jsonl

Import data from part1.log, part2.log, part3.log into part1.log.csv with keys converted to snake_case unique tails and with columns matched from line prefix (for lines formatted as <prefix> {<json>}).

flatjsonl -match-line-prefix '([\w\d-]+) [\w\d]+ ([\d/]+\s[\d:\.]+)' -replace-keys part1.log part2.log part3.log

Extract a single column from JSONL log (equivalent to cat huge.log | jq .foo.bar.baz > entries.log), flatjsonl is optimized for multi-core processors, so it can bring perfromance improvement compared to single-threaded jq.

flatjsonl -input huge.log -raw entries.log -get-key ".foo.bar.baz"

About

A tool to flatten JSONL into CSV or SQL

License:MIT License


Languages

Language:Go 98.6%Language:Makefile 1.4%