ev2900 / EMR_Studio_Hudi

Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio or EMR Notebooks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Apache Hudi Examples

map-user map-user

Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio and/or EMR Notebook(s).

Reference background on key concepts. If you are new to working with Hudi it is worth reading about Hudi's timeline, file management, index, table types, query types, copy on write, merge on read.

If you are not familiar with the core Hudi concepts or are new to Hudi I highly recommend you watch AWS re:Invent 2019: Insert, upsert, and delete data in Amazon S3 using Amazon.

Enviorment Set Up

The samples in this repository are designed to run on EMR via. EMR Notebooks or EMR Studio. To set up your enviorment follow the AWS documentation for EMR Notebooks or EMR Studio.

You can upload the .ipynb files in this repository directly to the Jupyter enviorments provides by EMR Notebooks / Studio

Copy on Write

The notebooks in copy_on_write is the best place to start. It covers working with data via. Hudi specific to copy on write tables. The notebook(s) covers

  • Writing data to S3
  • Reading data from S3
  • Upserting data
  • Incremental querying
  • Point in Time querying
  • Deleting Data

Both a Python and Scala notebooks are available.

Merge on Read

The notebook in merge_on_read is the best next step once you understand the copy_on_write notebook(s). The merge_on_read notebook covers

  • Writing data to S3
  • Upserting data
  • Snapshot queries
  • Read optimized queries
  • Compaction

Both a Python and Scala notebooks are available.

Future Imporvement to this Repo

  • Hudi SQL example(s)
  • Hudi time travel example(s)

About

Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio or EMR Notebooks


Languages

Language:Jupyter Notebook 100.0%