github-stargazers

Aggregate github star counts from GH Archive data in BigQuery and publish as Parquet.

Caveats

GH Archive only records star events, not un-stars, so star counts are only approximate (slightly overstated).
To avoid noise, we count unique users starring a repo, not raw events.
See the query here.

You can download the latest Parquet file from:

https://raw.githubusercontent.com/holdenmatt/github-stargazers/main/data/github-repos.parquet

First, install dependencies:

$ python -m venv venv
$ source venv/bin/activate
$ python -m pip install --upgrade pip
$ pip install -r src/requirements.txt

Then create a BigQuery service account:

Sign in with a Google Account.
Go to the BigQuery Console. If you have multiple Google accounts, make sure you’re using the correct one.
Create a new GCP project.
Create a new Service Account, and download the key as JSON (e.g. follow these instructions). For roles, add:
```
- BigQuery Job User
- BigQuery Read Session User
```

Create a .env file in the root of this repo, and add these variables, copying values from your JSON file:

GOOGLE_PROJECT_ID=<project_id>
GOOGLE_CLIENT_EMAIL=<user>@<host>.iam.gserviceaccount.com
GOOGLE_PRIVATE_KEY=<private key>

Run the script to generate a new Parquet file:

python src/main.py

Aggregate Github star counts from GH Archive (via BigQuery) and publish as Parquet.

MIT License

Language:Python 100.0%