Mempool Dumpster 🗑️♻️

Dump mempool transactions from EL nodes, and archive them in Parquet and CSV format.

Parquet: Transaction metadata (timestamp in millis, hash, attributes; about 100MB / day)
CSV: Raw transactions (RLP hex + timestamp in millis + tx hash; about 1GB / day zipped)
This project is under active development, although relatively stable and ready to use
Observing about 30k - 100k mempool transactions per hour (1M - 1.5M transactions per day)

System architecture

Mempool Collector: Connects to EL nodes and writes new mempool transactions to CSV files. Multiple collector instances can run without colliding.
Summarizer: Takes collector CSV files as input, dedupes, sorts by timestamp and writes to CSV + Parquet output files

Getting started

Mempool Collector

Connects to one or more EL nodes via websocket
Listens for new pending transactions
Writes timestamp + hash + rawTx to CSV file (one file per hour by default)

Default filename:

Schema: <out_dir>/<date>/transactions/txs_<date>_<uid>.csv
Example: out/2023-08-07/transactions/txs_2023-08-07-10-00_collector1.csv

Running the mempool collector:

# Connect to ws://localhost:8546 and write CSVs into ./out
go run cmd/collector/main.go -out ./out

# Connect to multiple nodes
go run cmd/collector/main.go -out ./out -nodes ws://server1.com:8546,ws://server2.com:8546

Summarizer

Iterates over collector output directory / CSV files
Creates summary file in Parquet format with key transaction attributes
TODO: create archive from output of multiple collectors
- Take several files/directories as input

go run cmd/summarizer/main.go -h

go run cmd/summarizer/main.go -out /mnt/data/mempool-dumpster/2023-08-12/ --out-date 2023-08-12 /mnt/data/mempool-dumpster/2023-08-12/2023-08-12_transactions/*.csv

Architecture

General design goals

Keep it simple and stupid
Vendor-agnostic (main flow should work on any server, independent of a cloud provider)
Downtime-resilience to minimize any gaps in the archive
Multiple collector instances can run concurrently, without getting into each others way
Summarizer script produces the final archive (based on the input of multiple collector outputs)
The final archive:
- Includes (1) parquet file with transaction metadata, and (2) compressed file of raw transaction CSV files
- Compatible with Clickhouse and S3 Select (Parquet using gzip compression)
- Easily distributable as torrent

Mempool Collector

NodeConnection
- One for each EL connection
- New pending transactions are sent to TxProcessor via a channel
TxProcessor
- Check if it already processed that tx
- Store it in the output directory

Summarizer

Uses https://github.com/xitongsys/parquet-go to write Parquet format

Contributing

Install dependencies

go install mvdan.cc/gofumpt@latest
go install honnef.co/go/tools/cmd/staticcheck@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
go install github.com/daixiang0/gci@latest

Lint, test, format

make lint
make test
make fmt

TODO

Lots, this is WIP

should:

collector support multiple -node cli args (like mev-boost)

could:

stats about which node saw how many tx first
http server to add/remove nodes, see stats, pprof?

Further notes

See also: discussion about compression and storage

License

MIT

zhiwei-w-luo / mempool-dumpster

Mempool Dumpster 🗑️♻️

System architecture

Getting started

Mempool Collector

Summarizer

Architecture

General design goals

Mempool Collector

Summarizer

Contributing

TODO

Further notes

License

Maintainers

About

Languages