press0 / csv2parquet

Apache Arrow CSV Parquet Pandas Interop

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CSV <=> Parquet transform utility powered by Apache Arrow.

Apache Arrow powers the Apache Parquet and Apache Spark projects.

Motivation

Cloud data platforms rely on Parquet; data analysts rely on CSV.

Getting Started

Install the application by running the following commands:

git clone https://github.com/press0/csv2parquet.git
cd csv2parquet
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Test objective

verify the CSV <=> Parquet transform is reversible.

Approach:

transform a csv file into a Parquet file  
transform the Parquet file back to a 2nd CSV file  
transform the 2 CSV files into Pandas DataFrames  
compare the two Pandas DataFrames for equality
python TestCSV2Parquet.py

CSV <=> Parquet transform with AWS Glue

PySpark script performs the CSV to Parquet transform on the AWS Glue service

Approach:

create AWS Glue crawler  
create AWS Glue Data Catalog table  
create AWS Glue ETL job 
create AWS IAM policy   
create AWS S3 bucket   
upload CSV file to S3
run AWS Glue crawler
run AWS Glue ETL job
download Parquet file from S3
load parquest data into pyarrow.Table

About

Apache Arrow CSV Parquet Pandas Interop

License:Apache License 2.0


Languages

Language:Python 100.0%