Apache Hudi (Incubating) (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals
.
Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).
- Upsert support with fast, pluggable indexing
- Atomically publish data with rollback support
- Snapshot isolation between writer & queries
- Savepoints for data recovery
- Manages file sizes, layout using statistics
- Async compaction of row & columnar data
- Timeline metadata to track lineage
Hudi provides the ability to query via three types of views:
- Read Optimized View - Provides excellent snapshot query performance via purely columnar storage (e.g. Parquet)
- Incremental View - Provides a change stream with records inserted or updated after a point in time.
- Real-time View - Provides snapshot queries on real-time data, using a combination of columnar & row-based storage (e.g Parquet + Avro)
Learn more about Hudi at https://hudi.apache.org
Hudi requires Java 8 to be installed on a *nix system. Check out code and normally build the maven project, from command line:
# Checkout code and build
git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
mvn clean package -DskipTests -DskipITs
Try https://hudi.apache.org/quickstart.html to quickly explore Hudi's capabilities using spark-shell.