hercules261188 / pypi-duck-flow

e2e data engineering project to get insights from PyPi using #python and #duckdb

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pypi Duck Flow : Get insights of your python project 🐍

This project is a collections of pipelines to get insights of your python project. It also serves as educational purpose (YouTube videos and blogs) to learn how to build data pipelines with Python, SQL & DuckDB.

The project is composed of series in 3 parts :

  • Ingestion (YouTube video)
  • transformation (TODO)
  • visualization (TODO)

Development

Setup

The project requires Python 3.11 and poetry for dependency management. There's also a devcontainer for VSCode. Finally a Makefile is available to run common tasks.

Env & credentials

A .env file is required to run the project. You can copy the .env.example file and fill the required values.

TABLE_NAME=pypi_file_downloads # output table name
S3_PATH=s3://my-s3-bucket # output s3 path
AWS_PROFILE=default # aws profile to use
GCP_PROJECT=my-gcp-project # GCP project to use
START_DATE=2023-04-01 # start date of the data to ingest
END_DATE=2023-04-03 # end date of the data to ingest
PYPI_PROJECT=duckdb # pypi project to ingest
GOOGLE_APPLICATION_CREDENTIALS=/path/to/my/creds # path to GCP credentials
motherduck_token=123123 # MotherDuck token
TIMESTAMP_COLUMN=timestamp # timestamp column name, use for partitions on S#
DESTINATION=local,s3,md # destinations to push data to, can be one or more

Ingestion

Requirements

  • GCP account
  • AWS S3 bucket (optional to push data to S3) and AWS credentials (at the default ~/.aws/credentials path) that has write access to the bucket
  • MotherDuck account (optional to push data to MotherDuck)

Run

Once you fill your .env file, you can simply use the make command : make pypi-ingest

About

e2e data engineering project to get insights from PyPi using #python and #duckdb


Languages

Language:Python 91.7%Language:Makefile 7.5%Language:Dockerfile 0.8%