apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine

Home Page:https://datafusion.apache.org/ballista

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could not create or read partition table

smallzhongfeng opened this issue · comments

Describe the bug
After the partition table is created, it cannot be read normally

To Reproduce

echo "1,2" > tmp/year=2022/data.csv
echo "3,4" > tmp/year=2021/data.csv

run in ballista-cli


❯ CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV PARTITIONED BY (year) LOCATION 'tmp';
ArrowError(SchemaError("Unable to get field named \"year\". Valid fields: [\"a\", \"b\"]"))

I deployed it in standalone mode.

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

image
I deployed it using the latest online version, and the client is also the latest version 0.11.0

@thinkharderdev @yahoNanJing @Dandandan Have you ever encountered similar problems? Could you guys give me some advice

Similar issue like this: #747

use datafusion::arrow::datatypes::DataType;
use datafusion::datasource::file_format::parquet::DEFAULT_PARQUET_EXTENSION;
use ballista::prelude::{BallistaConfig, BallistaContext, Result};
use datafusion::prelude::{CsvReadOptions, ParquetReadOptions, SessionContext};

#[tokio::main]
async fn main() -> Result<()> {
    let config = BallistaConfig::builder()
        .set("ballista.shuffle.partitions", "1")
        .build()?;

    let ctx = BallistaContext::standalone(&config, 2).await?;

    let options = ParquetReadOptions {
        file_extension: DEFAULT_PARQUET_EXTENSION,
        table_partition_cols: vec![("date".to_string(), DataType::Utf8)],
        parquet_pruning: Some(false),
        skip_metadata: Some(true),
    };
    let path= format!("tmp");

    let arc = ctx.read_parquet(&path, options).await?;
    println!("{}", arc.schema());
    arc.clone().select_columns(&["String", "date"]).unwrap();
    arc.clone().show().await?;
    Ok(())
}

This case also fail, so is it currently not supported to create a partition table?

Hi @smallzhongfeng, I'll take a look at this issue in this week.

Thank you for your reply. @yahoNanJing At present, my guess is that the partition field is treated as an ordinary field, resulting in an error when the schema is matched.

Any update ?

It looks the partitions are ignored, and the files inside are not loaded. Is there any update on how to deal that?

Any update?