altair code4rena github-api jupyter-notebook python scraping

Important

This repository is archived due to significant changes to both Code4rena's website and repos since first started over a year ago. Additionally, Code4rena started providing data through their community resources. A new repo is available as code4rena-stats for the charts and insights.

Archived README.md

code4rena-scraper

Scraping Code4rena contest audits reports for stats, fun (and profit ?).

For accurate prize money numbers check the Code4rena leaderboard directly.

Why ?

To play around with the Github API and work my python scripting skills. It also gave me the chance to work with data analysis tools such as Jupyter notebooks, Pandas for manipulating the data and Altair, a visualization framework for generating charts.

In the beginning, I was curious since I found out that the audits reports repos contains the address of each participant for sending their prize money (see here for example, in the .json files). I thought it would be interesting to try and track the flow of funds (which could be an issue if certain people wants to stay anonymous on this platform). However, this part is currently left out and the project quickly evolved into extracting data and building statistics from the Code4rena contests.

Also, I realized after a week of working on this project that the website repo of Code4rena already contains data for contests, findings and handles but hey, I learned a lot about the scraping process !

What ?

Data is scraped from the Code4rena published audits repos using the Github API, as well as directly from the leaderboard and contests entries of the Code4rena website and is parsed to CSV files. Original CSV files can also be used directly from the Code4rena repo in the contests/ and findings/ folders.

Part of the data extracted can be used to link ETH/Polygon addresses to contest participants. Using tools like polygonscan, etherscan or Bitquery allows to look at the flow of funds from and to those wallets (this part hasn't been implemented or explored too much yet).

Is it useful ? Probably not.

Worth the time ? I'd say yes as it gave me insights as to how to track funds accross different chains (Polygon, Ethereum mainnet, etc.).

Also, the extracted data allows to see who might be most efficient, writes the most duplicates, percentage of invalid submission, etc.

Jupyter notebooks

Notebooks can be found in the charts_data folder to visualize the data. A link is provided below each chart for a static view of each notebook. For an interactive lab, you could setup your own locally or run one online .

You can also run non-interactive notebooks through nbviewer or view the static generated html at https://krow10.github.io/code4rena-scraper/.

How ?

Install all requirements through pip install -r requirements.txt and setup your own Github access token in the .env file.

Then use main.py [leaderboard|contests|github|all] to fetch and parse the latest data in CSV files. A Github action is available for updating the CSV files in this repo directly.

Currently, the extracted data from the Github API (github_code4rena.csv) looks like this:

contest_id	handle	address	risk	title	issueId	issueUrl	contest_sponsor	date	tags	issueCreation
Identifiy the contest	Name of the warden	Polygon address	Caracterize the submission criticity (0 to 3, G for gas optimization, Q for QA)	Title of the submission	Github issue number	Github issue URL (unused)	Contest sponsor extracted from repo's name	Contest running date extracted from repo's name	Tags associated with issue (further caracterize the submission)	Creation time of the issue

So each line in the csv file corresponds to one submission (identified by the issueId) of a warden (identified by his/her (handle, address) pair) for a given contest (identified by the contest_id).

The data can then be imported inside a Jupyter notebook (or anywhere else, how you want to parse it) for easy processing and visualization like so:

import pandas as pd
import altair as alt

alt.data_transformers.disable_max_rows() # Disable 5_000 rows limit
data = pd.read_csv("../github_code4rena.csv") # Set path accordingly

# Visualize whatever (see https://altair-viz.github.io)
alt.Chart(...)

For the leaderboard (leaderboard_code4rena.csv), the data looks like this:

period	handle	is_team	prize_money	total_reports	high_all	high_solo	med_all	med_solo	gas_all
The period for which the data comes from	Name of the warden	Boolean indicating if the handle refers to a team or not	Total earnings for the period (in $USD)	Total accepted reports for the period	High severity issues found with others	High severity issues found alone	Medium severity issues found with others	Medium severity issues found alone	Gas optimization reports submitted

And for the contests (contests_code4rena.csv), the data looks like this:

contest_report_repo	contest_sponsor	contest_desc	start	end	prize_pool	handle	prize_money	total_reports	high_all	high_solo	med_all	med_solo	gas_all
The name of the Github repo for the contest audit report or empty if not published yet	Name of the contest sponsor (lowercase, stripped)	Description of the contest sponsor	Starting date of the contest	Ending date of the contest	Total prize pool (calculated from the sum of warden's prize money)	Name of the warden	Total earnings for the contest (in $USD)	Total accepted reports for the contest	High severity issues found with others	High severity issues found alone	Medium severity issues found with others	Medium severity issues found alone	Gas optimization reports submitted

Next ?

About

Scraping Code4rena contest audits reports for stats, fun (and profit ?)

https://krow10.github.io/code4rena-scraper/

altair code4rena github-api jupyter-notebook python scraping

Languages

Language:HTML 52.0%Language:Jupyter Notebook 47.9%Language:Python 0.0%