maxmouchet / mmdb-to-clickhouse

Import MMDB files into ClickHouse.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mmdb-to-clickhouse

Release Test

Import MMDB files into ClickHouse.

This repository contains a dummy example.mmdb file for testing purpose. To get real data, check out the free Country & ASN database from IPinfo IPinfo which is also supported by this tool.

Features

  • Automatically infers the table schema from the MMDB file
  • Supports nested records by flattening them
  • Stores data in a partitioned table to keep track of history
  • Creates an IP trie dictionary for fast lookups

Current limitations

  • The schema is inferred from the first record only. If subsequent records have additional fields, those will be ignored.
  • The network and partition names are reserved and must not be present in the MMDB file.
  • The username and password will be embedded in the dictionary source definition (see ClickHouse/ClickHouse#38991).
    • To avoid storing critical credentials in the dictionary definition, you can create a dedicated user that has only access to the MMDB tables.

Installation

Download the latest release for your operating system and run:

./mmdb-to-clickhouse -h

You can also run it through Docker with:

docker run --rm -it ghcr.io/maxmouchet/mmdb-to-clickhouse -h

Example usage

First start a ClickHouse instance:

docker run --name clickhouse --rm -d -p 9000:9000 clickhouse/clickhouse-server

Then download the example MMDB file:

wget https://github.com/maxmouchet/mmdb-to-clickhouse/raw/main/example.mmdb

And run mmdb-to-clickhouse:

./mmdb-to-clickhouse -dsn clickhouse://localhost:9000 -mmdb example.mmdb -name example_mmdb -test

(See clickhouse-go documentation for the DSN format and allowed values)

The output should look like the following:

2024/07/04 13:51:07 Net schema: `network` String, `pointer` UInt64, `partition` Date
2024/07/04 13:51:07 Val schema: `pointer` UInt32, `country` String, `partition` Date
2024/07/04 13:51:07 Creating table example_mmdb_net_history
2024/07/04 13:51:07 Creating table example_mmdb_val_history
2024/07/04 13:51:07 Creating dictionary example_mmdb_net
2024/07/04 13:51:07 Creating dictionary example_mmdb_val
2024/07/04 13:51:07 Dropping partition 2024-07-04
2024/07/04 13:51:07 Dropping partition 2024-07-04
2024/07/04 13:51:07 Inserting data
2024/07/04 13:51:07 Inserted 1 networks and 1 values
2024/07/04 13:51:07 Creating function example_mmdb
2024/07/04 13:51:07 Running test query: SELECT example_mmdb('1.1.1.1', 'country')
2024/07/04 13:51:07 This may take some time as the dictionnary gets loaded in memory
2024/07/04 13:51:07 Test query result: WW

This will create:

  • Two partitioned tables (30 days of history by default, see the -ttl option):
    • example_mmdb_net_history: IP networks and pointers to distinct values
    • example_mmdb_val_history: pointers and associated values
  • Two dictionaries:
    • example_mmdb_net: an in-memory IP trie which always uses the latest partition from example_mmdb_net_history. This dictionary enables very fast IP lookups.
    • example_mmdb_val: an in-memory KV mapping which always uses the latest partition from example_mmdb_val_history.
  • One function:
    • example_mmdb(ip, attrs): this function first looks up the pointer in example_mmdb_net and then retrieves the value in example_mmdb_val.

Open a REPL and inspect the tables:

docker exec -it clickhouse clickhouse client
SHOW TABLES
-- ┌─name─────────────────────┐
-- │ example_mmdb_net         │
-- │ example_mmdb_net_history │
-- │ example_mmdb_val         │
-- │ example_mmdb_val_history │
-- └──────────────────────────┘

SELECT * FROM example_mmdb_net
-- ┌─network───┬─pointer─┬──partition─┐
-- │ 0.0.0.0/0 │       0 │ 2024-07-04 │
-- └───────────┴─────────┴────────────┘

SELECT * FROM example_mmdb_val
-- ┌─pointer─┬─country─┬──partition─┐
-- │       0 │ WW      │ 2024-07-04 │
-- └─────────┴─────────┴────────────┘

SELECT example_mmdb('1.1.1.1', 'country') AS country
-- ┌─country─┐
-- │ WW      │
-- └─────────┘

To clean up just remove the ClickHouse instance:

docker rm -f clickhouse

Or, to cleanup tables manually:

DROP FUNCTION example_mmdb;
DROP DICTIONARY example_mmdb_net;
DROP DICTIONARY example_mmdb_val;
DROP TABLE example_mmdb_net_history;
DROP TABLE example_mmdb_val_history;

Performance

Tests performed with ClickHouse 24.6.1 and mmdb-to-clickhouse 1.2.1 on a VM with 4 vCPUs (i5-12600H) and 32GB of memory.

Database MMDB size Networks Values Insertion time Dict cold load time Dict mem usage Lookup
IPinfo Privacy 293MB 12M 387 30 seconds 30 seconds 558MB 40M rows/s
IPinfo Location 1.6GB 205M 358k 9 minutes 8 minutes 9GB 40M rows/s

About

Import MMDB files into ClickHouse.

License:MIT License


Languages

Language:Go 100.0%