Crawler and node tracker using reth p2p

Question

Crawler and node tracker using reth p2p

Rjected opened this issue a year ago · comments

Motivation

Ethernodes is a public-facing node tracker used for ethereum nodes, and even serves as one of the data sources for execution client diversity. It has information such as:

Client version (user agent, e.g. geth/v1.11.6-stable/linux-amd64/go1.20.3)
Best block
Current difficulty
Time since "last seen"

It's definitely an interesting data source, but because it is not open source, it's hard to know the exact methodology used to produce the data. Given the metadata it has about nodes, it is probably using a p2p crawler underneath, and at least gaining information from the eth Status handshake message.

There is a node crawler embedded in geth, devp2p crawl. devp2p crawl is useful as a simple reference implementation of an ethereum crawler.

It would be interesting to build a crawler that is:

at least as good as devp2p crawl
open source, making the methodology more transparent than ethernodes

It would then be very easy to do research on the p2p network in rust, and better understand node usage trends.

How ethereum node crawlers work

Crawlers do a few main things:

Crawl a node discovery network
- In ethereum: discv4, dnsdisc (implemented in reth), discv5 (not implemented in reth yet)
Establish sessions with discovered nodes
Perform some sort of validation on the discovered nodes
- Lots of data can be easily faked or sybil-ed, and connections are easy to spin up! This is mostly an inherent property of permissionless p2p networks, like DHTs.
Persist metadata obtained during the above steps

The first step is to crawl a discovery network to obtain Peer IDs and IPs to establish connections with. Ethereum can be challenging in that discv4 contains nodes from not only ethereum, but other networks as well (BSC, testnets, etc.). Reth already implements two methods to mitigate this concern:

dnsdisc, or EIP-1459, which uses public endpoints populated with network-curated peers
EIP-2124, which adds the node's fork ID to discv4 records, making it possible to filter out nodes from different networks before establishing a connection. Trying to establish connections with nodes that don't share a network wastes valuable time that can be spent crawling instead.

There are three steps to establishing connections once a peer is found:

Note

Everything up to this point has already been implemented in reth, and serves as an introduction to ethereum networking.
The tools to discover peers, filter those peers by network, and establish connections, can be accomplished by using existing reth networking libraries.

Once a crawler successfully establishes a connection with another node, it can perform extra validation specific to the protocol, to make it more difficult to sybil the crawler. On ethereum, this might be something like sampling some random set of block hashes. No method makes it entirely impossible to sybil the crawler, but it's worth thinking about simple strategies that can distinguish a "real" node from a fake peer on the network.

The metadata produced from the handshake can then be tracked and eventually persisted:

Timestamps
IPs
The Hello handshake message
The remote public key / peer ID
The Status handshake message
Generally, the transcript of the entire session.
maybe more??

Basic crawler MVP

Ultimately the crawler is supposed to produce a list of peers that could be considered "real" based on some set of heuristics, and allow access to the metadata in a useful way. After obtaining a stream or list of peers, the open problems are:

How should this data be structured?
- Should information be indexed somehow? For example metadata could be indexed by peer ID, or by IP.
Where should the data be stored?
- A prototype should probably store everything in-memory, but devp2p crawl for example saves to a JSON file. It's also worth thinking about how this could use a database.
How should users access this data?
- Will users primarily access the data once it's persisted, like with devp2p crawl, or should there be a network API?
- If there is an API, what will it look like? What endpoints are the most important or useful?

Timeseries tracking

It should be possible to specify a crawling interval, so network data can be stored as a timeseries. Bitnodes for example, is able to track user agents / client versions over time:

This would be useful for tracking whether or not users are upgrading to new versions of reth, and identify trends in overall reth usage. This feature would also mean the crawler surpasses all existing node trackers in functionality.

Examples of p2p code in reth

This article is also a great introduction to the reth p2p stack: Diving into the Reth p2p stack

Other node trackers

etherscan
execution-diversity.info - self reported, see methodology section of clientdiversity.org

chirag-bgh · Answer 1 · Tue Sep 26 2023 19:42:52 GMT+0800 (China Standard Time)

Hey @Rjected ,
I and @alessandromazza98 are going to work on this.

prames · Answer 2 · Tue Sep 26 2023 22:14:43 GMT+0800 (China Standard Time)

hey @chirag-bgh/ @alessandromazza98 - I started digging into the design for this a bit earlier as well. Happy to help you both contribute, maybe we can start a TG groupchat?

chirag-bgh · Answer 3 · Tue Sep 26 2023 22:22:54 GMT+0800 (China Standard Time)

hey @chirag-bgh/ @alessandromazza98 - I started digging into the design for this a bit earlier as well. Happy to help you both contribute, maybe we can start a TG groupchat?

sounds good! let me set up the tg

prames · Answer 4 · Thu Sep 28 2023 22:38:17 GMT+0800 (China Standard Time)

9/28 meeting notes - @alessandromazza98 / @0xprames (will catch @chirag-bgh async)

Discussion of high level design and issue open questions
Discussed task breakdown (prototype vs design + longer term impl work)

End state for this issue eventually looks like etherrnodes etc, with a service that periodically (asynchronously) crawls, updates a datastore with the relevant data, and an API service layer to serve both programmatic requests for the data along w/ a frontend (web client) view. Prototype initially will be a rust/reth rewrite of geth/devp2p/crawl

Open Questions to answer with a design doc:

Figure out data model - what data do we want to store, and how do we want to represent it i.e from the original issue:

 The metadata produced from the handshake can then be tracked and eventually persisted:
* Timestamps
* IPs
* The Hello handshake message
* The remote public key / peer ID
* The Status handshake message
* Generally, the transcript of the entire session.
* maybe more??

Data indexing question: “Should information be indexed somehow? from original issue: For example metadata could be indexed by peer ID, or by IP” : effectively asking what is our “primary key” when we store the data, do we want two tables one indexed with a peer-ID primary key and one by IP? what is the cost of that etc.
How do we want to persist the data - if we work backwards from an etherrnodes type solution, we most probably need this data persisted in a database of some sort, and served via a service layer. We discussed: do we re-use the rethdb and add a table? (probably not, don’t want to couple this with rethdb?) - what external store do we want (Dynamodb, scylla, PQ, SQLite etc etc etc) - choose between SQL and NOSQL -> may make sense to lean towards a K/V setup, but needs to be thought through. May not make sense for relational here, periodic scans of K/V table can be done without incurring too much cost (for example needing to reconcile data with an async reconciler) - also need to think potentially select an impl that allows for CDC, do we need to propagate changes for a single record downstream? May help propagate ("stream") most up-to-date data to FE view and other potential consumers...
For a prototype (being worked on by Alessandro/Chirag) - as the original issue mentions, we can store the data in memory or as a jsonfile similar to geth, but that solution is an initial solution and probably doesn’t make sense for an etherrnodes like service/application.
How should users access this data: probably wrap our datastore with a service layer - users can access it via a programmatic API, and via a Frontend that is similar to etherrnodes/bitnodes etc. API design needs to be though through (restful or not etc)
Sybil resistance - what to here? Need to think this through and answer in design doc. Alessandro: Random sample of block hashes and other data begins to protect against fake nodes/peers that are effectively sending bs spam data. Figure out an acceptable probabilistic threshold given data sampling to determine a “real” peer vs a “fake” peer. Also look into geth crawler and other open source crawlers (?) to see how they do it.

Action Items: -
@alessandromazza98 / @chirag-bgh will drive the initial impl of a prototype version of rethp2p crawl (similar to geth/devp2pcrawl) focused on storing to json or an in-memory store.  

@0xprames/ @alessandromazza98 / @chirag-bgh going to drive a collaborative design document (1-3 pager probably) for an eventual crawler “service” that will serve as a backend to a new, opensource, ethernnodes like frontend, and for users to plug into programmatically via API calls. This service should

periodically crawl and persist relevant data to a datastore (this allows for timeseries data as discussed in issue)
allow for consumers (FE application will be one such consumer) to read data via a service api

First Draft ETA (for public review on this issue): Oct 6th. Will review on TG chat internally throughout as design doc is being worked on.

Also may make sense to add @Rjected to TG chat so he can help shepherd the progress on this from the reth team side.

Georgios Konstantopoulos · Answer 5 · Fri Oct 06 2023 23:25:09 GMT+0800 (China Standard Time)

RE: DB choice I would not overthink it, I would primary key users by IP (or other identifier), and put each field as a column, probably on a Postgres, and call it a day? For aggregate queries you can just query the SQL, the rows will be few thousands/tens of thousands so it'll be instantly queriable, and you get all the nice auto API generators on top of postgres if needed. Rest looks good to me! Excited.

prames · Answer 6 · Sun Oct 08 2023 14:16:58 GMT+0800 (China Standard Time)

Yeah makes sense - that definitely will work. I want to make sure we can support timeseries data charting with pg, but otherwise just keeping it simple DB wise is fine.

I've shared the draft design doc on our TG with the HLD, and a few low level design decisions as well. I think should be good to start implementing the service.

Open to sharing a copy here as well - i could merge it into the repo as an MD and share a link.

Working on refactoring the current crawler code now (rn the codebase we have just subscribes to disc updates and persists to a file, but we aren't "crawling" like geth does), and then going to code out the apiserver/db wrapper

prames · Answer 7 · Wed Oct 18 2023 19:16:28 GMT+0800 (China Standard Time)

updating here to make sure this issue doesn't go stale - we've got :

~10k unique peers from less than a day of running our crawler (forcing "rotation" via restarts) - this was run ~4-5 days ago

the apiserver also seems to work from local testing cc @alessandromazza98 who has been working on that!

Hoping to ship this soon (possibly by EOW, 2 weeks max) - currently working on ensuring crawler works without any slowdowns/perf issues (and matches the crawler impl here https://github.com/ethereum/node-crawler/)

Georgios Konstantopoulos · Answer 8 · Thu Oct 19 2023 01:38:40 GMT+0800 (China Standard Time)

Excellent! We would be happy to host this under diversity.paradigm.xyz or something similar. Whether it's a raw dataset that we semi-regularly update, or a proper dashboard.

github-actions · Answer 9 · Thu Nov 09 2023 09:48:51 GMT+0800 (China Standard Time)

This issue is stale because it has been open for 21 days with no activity.

prames · Answer 10 · Fri Nov 10 2023 23:25:30 GMT+0800 (China Standard Time)

hey friends

rethcrawler.xyz is the current (mvp) launch!

probably needs a few things - but i think we achieved the spirit of the goal from the initial issue.

Feel free to work with @alessandromazza98 on anything else (domain migration, code cleanup etc)

there has been some discussion on extending this to track CL stats, and i think its a good idea. but maybe out of scope for this specific issue.

there is the outstanding question of infra, we definitely want to migrate off of aws - it was relatively quick to setup for this initial mvp, but it's a bit pricey.

I think @0xSmit had a good point w/ digital ocean. I've opened Keep-Reth-Strange/reth-crawler#70 to track this, and as discussed he may be taking this forward. it should be easy enough to replicate this setup with a cheaper infra provider!

Dan Cline · Answer 11 · Thu Mar 07 2024 02:45:09 GMT+0800 (China Standard Time)

Closing this! https://github.com/Keep-Reth-Strange/reth-crawler