holdenmatt / github-stargazers

Aggregate Github star counts from GH Archive (via BigQuery) and publish as Parquet.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

github-stargazers

Aggregate github star counts from GH Archive data in BigQuery and publish as Parquet.

Caveats

  • GH Archive only records star events, not un-stars, so star counts are only approximate (slightly overstated).
  • To avoid noise, we count unique users starring a repo, not raw events.
  • See the query here.

Download data

You can download the latest Parquet file from:

https://raw.githubusercontent.com/holdenmatt/github-stargazers/main/data/github-repos.parquet

Run locally

First, install dependencies:

$ python -m venv venv
$ source venv/bin/activate
$ python -m pip install --upgrade pip
$ pip install -r src/requirements.txt

Then create a BigQuery service account:

  1. Sign in with a Google Account.

  2. Go to the BigQuery Console. If you have multiple Google accounts, make sure you’re using the correct one.

  3. Create a new GCP project.

  4. Create a new Service Account, and download the key as JSON (e.g. follow these instructions). For roles, add:

    - BigQuery Job User
    - BigQuery Read Session User
    

Create a .env file in the root of this repo, and add these variables, copying values from your JSON file:

GOOGLE_PROJECT_ID=<project_id>
GOOGLE_CLIENT_EMAIL=<user>@<host>.iam.gserviceaccount.com
GOOGLE_PRIVATE_KEY=<private key>

Run the script to generate a new Parquet file:

python src/main.py

About

Aggregate Github star counts from GH Archive (via BigQuery) and publish as Parquet.

License:MIT License


Languages

Language:Python 100.0%