Dump mempool transactions from EL nodes, and archive them in Parquet and CSV format.
- Parquet: Transaction metadata (timestamp in millis, hash, attributes; about 100MB / day)
- CSV: Raw transactions (RLP hex + timestamp in millis + tx hash; about 1GB / day zipped)
- This project is under active development, although relatively stable and ready to use
- Observing about 30k - 100k mempool transactions per hour (1M - 1.5M transactions per day)
- Mempool Collector: Connects to EL nodes and writes new mempool transactions to CSV files. Multiple collector instances can run without colliding.
- Summarizer: Takes collector CSV files as input, dedupes, sorts by timestamp and writes to CSV + Parquet output files
- Connects to one or more EL nodes via websocket
- Listens for new pending transactions
- Writes
timestamp
+hash
+rawTx
to CSV file (one file per hour by default)
Default filename:
- Schema:
<out_dir>/<date>/transactions/txs_<date>_<uid>.csv
- Example:
out/2023-08-07/transactions/txs_2023-08-07-10-00_collector1.csv
Running the mempool collector:
# Connect to ws://localhost:8546 and write CSVs into ./out
go run cmd/collector/main.go -out ./out
# Connect to multiple nodes
go run cmd/collector/main.go -out ./out -nodes ws://server1.com:8546,ws://server2.com:8546
- Iterates over collector output directory / CSV files
- Creates summary file in Parquet format with key transaction attributes
- TODO: create archive from output of multiple collectors
- Take several files/directories as input
go run cmd/summarizer/main.go -h
go run cmd/summarizer/main.go -out /mnt/data/mempool-dumpster/2023-08-12/ --out-date 2023-08-12 /mnt/data/mempool-dumpster/2023-08-12/2023-08-12_transactions/*.csv
- Keep it simple and stupid
- Vendor-agnostic (main flow should work on any server, independent of a cloud provider)
- Downtime-resilience to minimize any gaps in the archive
- Multiple collector instances can run concurrently, without getting into each others way
- Summarizer script produces the final archive (based on the input of multiple collector outputs)
- The final archive:
- Includes (1) parquet file with transaction metadata, and (2) compressed file of raw transaction CSV files
- Compatible with Clickhouse and S3 Select (Parquet using gzip compression)
- Easily distributable as torrent
NodeConnection
- One for each EL connection
- New pending transactions are sent to
TxProcessor
via a channel
TxProcessor
- Check if it already processed that tx
- Store it in the output directory
- Uses https://github.com/xitongsys/parquet-go to write Parquet format
Install dependencies
go install mvdan.cc/gofumpt@latest
go install honnef.co/go/tools/cmd/staticcheck@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
go install github.com/daixiang0/gci@latest
Lint, test, format
make lint
make test
make fmt
Lots, this is WIP
should:
- collector support multiple
-node
cli args (like mev-boost)
could:
- stats about which node saw how many tx first
- http server to add/remove nodes, see stats, pprof?
- See also: discussion about compression and storage
MIT