Support partition pruning
jychen7 opened this issue · comments
Rich commented
Background
assume we have
table/year=2022/month=03/day=20/log.parquet
table/year=2022/month=03/day=21/log.parquet
consider the query
select count(1) from table where year = '2022' and month = '03' and day = '20'
Actual
The above query will scan all parquet files, instead of only one (necessary)
currently, roapi supports reading all parquet files in a directory
tables:
- name: "blogs"
uri: "table/"
option:
format: "parquet"
use_memory_table: false
schema:
# columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc
Expect
The above query will scan only one parquet file
however, since partition column is not in parquet schema, one idea is to improve roapi config to
tables:
- name: "blogs"
uri: "table/"
option:
format: "parquet"
use_memory_table: false
schema:
# columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc
partitions:
- name: "year"
data_type: "Utf8"
- name: "month"
data_type: "Utf8"
- name: "day"
data_type: "Utf8"
Reference
- apache/arrow-datafusion#2061 (2022, Datafusion has file re-structure at late 2022, just for reference)
roapi/columnq/src/table/mod.rs
Lines 50 to 54 in 51e01ef
- https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/datasource/listing/table.rs#L318-L324
QP Hou commented
Definitely a good feature to add
Chitral Verma commented
@jychen7 shouldn't such partition directories be inferred automatically as Spark does instead of manually supplying them?