hudi emr elastic-map-reduce apache-hudi hudi-examples aws

Apache Hudi Examples

Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio and/or EMR Notebook(s).

Reference background on key concepts. If you are new to working with Hudi it is worth reading about Hudi's timeline, file management, index, table types, query types, copy on write, merge on read.

If you are not familiar with the core Hudi concepts or are new to Hudi I highly recommend you watch AWS re:Invent 2019: Insert, upsert, and delete data in Amazon S3 using Amazon.

Enviorment Set Up

The samples in this repository are designed to run on EMR via. EMR Notebooks or EMR Studio. To set up your enviorment follow the AWS documentation for EMR Notebooks or EMR Studio.

You can upload the .ipynb files in this repository directly to the Jupyter enviorments provides by EMR Notebooks / Studio

Copy on Write

The notebooks in copy_on_write is the best place to start. It covers working with data via. Hudi specific to copy on write tables. The notebook(s) covers

Writing data to S3
Reading data from S3
Upserting data
Incremental querying
Point in Time querying
Deleting Data

Both a Python and Scala notebooks are available.

Merge on Read

The notebook in merge_on_read is the best next step once you understand the copy_on_write notebook(s). The merge_on_read notebook covers

Writing data to S3
Upserting data
Snapshot queries
Read optimized queries
Compaction

Both a Python and Scala notebooks are available.

Future Imporvement to this Repo

Hudi SQL example(s)
Hudi time travel example(s)

About

Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio or EMR Notebooks

hudi emr elastic-map-reduce apache-hudi hudi-examples aws

Languages

Language:Jupyter Notebook 100.0%