Trim common hdfs prefix in index DF
shay1bz opened this issue · comments
Try to recognize common path prefixes on runtime, and trim them.
For example, files in a standard table might look like:
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet
...
On read, before the shuffle, we can trim the common prefix to reduce the shuffle size.