Reading log files to duckdb to query them using SQL
init schema based on log file
import data from nginx log files, if run without parameters, if one log file is specified it will be just inserted
pip install duckdb geoip2 numpy
dpavlin@zamd:/zamd/dpavlin/duckdb-logs$
. /zamd/dpavlin/duckdb/venv/bin/activate
python3 -i geo.py
(venv) dpavlin@zamd:/zamd/dpavlin/duckdb$ python geo.py reload
Read maxmind database CSV and import it into duckdb based on duckdb/duckdb#10303
duckdb python udf functions which use geoip2 module and mmdb
dpavlin@zamd:/zamd/dpavlin/duckdb-logs$ python -i geo2.py logs.duckdb rebuild
dpavlin@zamd:/zamd/dpavlin/duckdb-logs$ python -i geo2.py logs.duckdb
import https://zeek.org/ conn.log files
./zeek-pull.sh
./zeek2duckdb.sh
./zeek-attach.sh
dpavlin@zamd:~/duckdb-logs$ ./duckdb --init zeek-attach.sql
-- Loading resources from zeek-attach.sql
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D select time_bucket('300 seconds',ts) as t,orig_h, resp_h, count(*) as c, sum(orig_ip_bytes) as o_b, sum(resp_ip_bytes) as r_b, from c where ts > '2024-03-10 05:00:00' and ts < '2024-03-10 06:00:00' and orig_ip_bytes > 10000 group by t,orig_h,resp_h order by t;
┌──────────────────────────┬─────────────┬───────────────┬───────┬────────┬────────┐
│ t │ orig_h │ resp_h │ c │ o_b │ r_b │
│ timestamp with time zone │ varchar │ varchar │ int64 │ int128 │ int128 │
├──────────────────────────┼─────────────┼───────────────┼───────┼────────┼────────┤
│ 2024-03-10 05:50:00+01 │ 10.60.1.163 │ 20.250.77.142 │ 1 │ 136508 │ 132468 │
└──────────────────────────┴─────────────┴───────────────┴───────┴────────┴────────┘
re-run last SQL query from ~/.duckdb_history using less as pager
very nice information how to use DuckDB if this repository is not enough for you