This project is a collections of pipelines to get insights of your python project. It also serves as educational purpose (YouTube videos and blogs) to learn how to build data pipelines with Python, SQL & DuckDB.
The project is composed of series in 3 parts :
- Ingestion (YouTube video)
- transformation (TODO)
- visualization (TODO)
The project requires Python 3.11 and poetry for dependency management.
There's also a devcontainer for VSCode.
Finally a Makefile
is available to run common tasks.
A .env
file is required to run the project. You can copy the .env.example
file and fill the required values.
TABLE_NAME=pypi_file_downloads # output table name
S3_PATH=s3://my-s3-bucket # output s3 path
AWS_PROFILE=default # aws profile to use
GCP_PROJECT=my-gcp-project # GCP project to use
START_DATE=2023-04-01 # start date of the data to ingest
END_DATE=2023-04-03 # end date of the data to ingest
PYPI_PROJECT=duckdb # pypi project to ingest
GOOGLE_APPLICATION_CREDENTIALS=/path/to/my/creds # path to GCP credentials
motherduck_token=123123 # MotherDuck token
TIMESTAMP_COLUMN=timestamp # timestamp column name, use for partitions on S#
DESTINATION=local,s3,md # destinations to push data to, can be one or more
- GCP account
- AWS S3 bucket (optional to push data to S3) and AWS credentials (at the default
~/.aws/credentials
path) that has write access to the bucket - MotherDuck account (optional to push data to MotherDuck)
Once you fill your .env
file, you can simply use the make command :
make pypi-ingest