preset-io / dbt_github_archive_bigquery

A dbt project to make GitHub archive events useful

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dbt reference implementation for Github Archive

This is a quick and dirty reference implementation to make sense of the GitHub public information made available by the GH Archive through BigQuery public datasets.

That data is a bit rough in a bunch of yearly/monthly/daily archived tables that are fairly large (TBs) and you probably want to bring only the orgs/repos you care about in a single table, and hopefully do some decent incremental loads to make this queryable.

This dbt project does all this:

  • brings all the archived tables in one centralized table in your local BigQuery project
  • partitions by day, does incremental loads
  • allows you to select just the repos you need
  • rebuils some state tables off of the events table
  • parses out important information out of JSON blobs
  • get rid of redundant or not-so-useful-for-analytics information

About

A dbt project to make GitHub archive events useful


Languages

Language:Python 100.0%