Aggregate github star counts from GH Archive data in BigQuery and publish as Parquet.
- GH Archive only records star events, not un-stars, so star counts are only approximate (slightly overstated).
- To avoid noise, we count unique users starring a repo, not raw events.
- See the query here.
You can download the latest Parquet file from:
https://raw.githubusercontent.com/holdenmatt/github-stargazers/main/data/github-repos.parquet
First, install dependencies:
$ python -m venv venv
$ source venv/bin/activate
$ python -m pip install --upgrade pip
$ pip install -r src/requirements.txt
Then create a BigQuery service account:
-
Sign in with a Google Account.
-
Go to the BigQuery Console. If you have multiple Google accounts, make sure you’re using the correct one.
-
Create a new GCP project.
-
Create a new Service Account, and download the key as JSON (e.g. follow these instructions). For roles, add:
- BigQuery Job User - BigQuery Read Session User
Create a .env
file in the root of this repo, and add these variables,
copying values from your JSON file:
GOOGLE_PROJECT_ID=<project_id>
GOOGLE_CLIENT_EMAIL=<user>@<host>.iam.gserviceaccount.com
GOOGLE_PRIVATE_KEY=<private key>
Run the script to generate a new Parquet file:
python src/main.py