paypal / dione

Dione - a Spark and HDFS indexing library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Trim common hdfs prefix in index DF

shay1bz opened this issue · comments

Try to recognize common path prefixes on runtime, and trim them.
For example, files in a standard table might look like:

hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet
...

On read, before the shuffle, we can trim the common prefix to reduce the shuffle size.