![](https://private-user-images.githubusercontent.com/487433/243007231-91891250-821c-40b7-b9e7-8215382aeefe.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTIwNDA1MjksIm5iZiI6MTcxMjA0MDIyOSwicGF0aCI6Ii80ODc0MzMvMjQzMDA3MjMxLTkxODkxMjUwLTgyMWMtNDBiNy1iOWU3LTgyMTUzODJhZWVmZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNDAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDQwMlQwNjQzNDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mNTkwOTlhMTk0ZmNmNjk1ZDU4NTdlMTE3MzVhMTAyNDU4OGIxNWYyODk0NTU2NmEzZmM1MjZiZGNiOTZhNzY2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.z-AtMuApBy5LeTbRoemHh-4dWMRvhpRYP3e77BP4O48)
This is a quick and dirty reference implementation to make sense of the GitHub public information made available by the GH Archive through BigQuery public datasets.
That data is a bit rough in a bunch of yearly/monthly/daily archived tables that are fairly large (TBs) and you probably want to bring only the orgs/repos you care about in a single table, and hopefully do some decent incremental loads to make this queryable.
This dbt project does all this:
- brings all the archived tables in one centralized table in your local BigQuery project
- partitions by day, does incremental loads
- allows you to select just the repos you need
- rebuils some state tables off of the events table
- parses out important information out of JSON blobs
- get rid of redundant or not-so-useful-for-analytics information