djouallah / Light_ETL_Challenge

comparing ETL performance of Python Data Engines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Light_ETL_Challenge

Note : Microsoft Fabric Notebook support two runtime : Spark, and pure python (private preview)

This is just a personal project and does not represents the view of my employer.

Extract data fom a csv. the number of columns is higher than what's in the header, filter a subset of data and export to Delta Lake I started with Duckdb , Polars ,Pandas,Pyspark, Pyarrow, Ibis but I expect more engines like chdb, Raft etc

the script will download 60 files, around 4 GB uncompressed

Please keep the local Path as

for raw data : raw_landing='/lakehouse/default/Files/raw'

for Delta Python : '/lakehouse/default/Tables/'

For Fabric, it will automatically detect Spark

image image

About

comparing ETL performance of Python Data Engines


Languages

Language:Jupyter Notebook 100.0%