Trim common hdfs prefix in index DF

Question

shay1bz opened this issue 3 years ago · comments

Try to recognize common path prefixes on runtime, and trim them.
For example, files in a standard table might look like:

hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet
...

On read, before the shuffle, we can trim the common prefix to reduce the shuffle size.