Reading log files to duckdb to query them using SQL

Import nginx log files into duckdb

l.sql

init schema based on log file

l-nginx.sh

import data from nginx log files, if run without parameters, if one log file is specified it will be just inserted

duckdb UDF in python

Install python modules

pip install duckdb geoip2 numpy

activate virtual enviroment

dpavlin@zamd:/zamd/dpavlin/duckdb-logs$

. /zamd/dpavlin/duckdb/venv/bin/activate
python3 -i geo.py

(venv) dpavlin@zamd:/zamd/dpavlin/duckdb$ python geo.py reload

geo.py

Read maxmind database CSV and import it into duckdb based on duckdb/duckdb#10303

geo2.py

duckdb python udf functions which use geoip2 module and mmdb

dpavlin@zamd:/zamd/dpavlin/duckdb-logs$ python -i geo2.py logs.duckdb rebuild

dpavlin@zamd:/zamd/dpavlin/duckdb-logs$ python -i geo2.py logs.duckdb

zeek

import https://zeek.org/ conn.log files

Pull new log files

./zeek-pull.sh

Create duckdb databases for each day

./zeek2duckdb.sh

Create view which include all data

./zeek-attach.sh

Connect to duckdb and query all data

dpavlin@zamd:~/duckdb-logs$ ./duckdb --init zeek-attach.sql
-- Loading resources from zeek-attach.sql
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D select time_bucket('300 seconds',ts) as t,orig_h, resp_h, count(*) as c, sum(orig_ip_bytes) as o_b, sum(resp_ip_bytes) as r_b, from c where ts > '2024-03-10 05:00:00' and ts < '2024-03-10 06:00:00' and orig_ip_bytes > 10000 group by t,orig_h,resp_h order by t;
┌──────────────────────────┬─────────────┬───────────────┬───────┬────────┬────────┐
│            t             │   orig_h    │    resp_h     │   c   │  o_b   │  r_b   │
│ timestamp with time zone │   varchar   │    varchar    │ int64 │ int128 │ int128 │
├──────────────────────────┼─────────────┼───────────────┼───────┼────────┼────────┤
│ 2024-03-10 05:50:00+01   │ 10.60.1.163 │ 20.250.77.142 │     1 │ 136508 │ 132468 │
└──────────────────────────┴─────────────┴───────────────┴───────┴────────┴────────┘

helper scripts

dless.sh

re-run last SQL query from ~/.duckdb_history using less as pager

Cooking with DuckDB

very nice information how to use DuckDB if this repository is not enough for you

https://duckdb.hrbrmstr.app/

https://codeberg.org/hrbrmstr/cooking-with-duckdb

dpavlin / duckdb-logs