Feature request: Compression algorithm information

Question

Feature request: Compression algorithm information

Mickael-van-der-Beek opened this issue a year ago · comments

Mickael van der Beek commented a year ago

Hello Manoj,

Very useful tool you have built!

One feature I would like to suggest is to display which compression algorithm was used on each column.
Currently, it is possible to see that compression was used based on the size difference of the "total compressed size" and "total uncompressed size" sizes but the actual algorithm used doesn't seem to be displayed.

So the idea would be to show "GZIP", "LZO", "ZSTD", "Brotli", etc. in the schema description table.

SteveLauC · Answer 1 · Thu May 04 2023 20:11:33 GMT+0800 (China Standard Time)

Hi guys, I am interested in implementing this feature, here is a draft code to print the compression algorithm being used for every column:

use parquet::file::reader::{FileReader, SerializedFileReader};
use std::fs::File;

#[tokio::main]
async fn main() {
    let file = File::open("parquet/1.parquet").unwrap();
    let reader = SerializedFileReader::new(file).unwrap();
    let mut column_names = Vec::new();

    // read column names
    let first_row = reader.into_iter().next().expect("expected at least 1 row");
    let first_row_column_iter = first_row.get_column_iter();
    first_row_column_iter.for_each(|(name, _)| column_names.push(name.to_string()));

    let file = File::open("parquet/1.parquet").unwrap();
    let reader = SerializedFileReader::new(file).unwrap();
    let file_meta = reader.metadata();
    for (idx, row_group) in file_meta.row_groups().iter().enumerate() {
        println!("Row Group {}", idx);
        for (idx, column) in row_group.columns().iter().enumerate() {
            println!("\t{}: {}", column_names[idx], column.compression());
        }
    }
}

$ pqrs cat parquet/1.parquet

#######################
File: parquet/1.parquet
#######################

{age: 18, name: "steve", timestamp: 0}

$ cargo r -q
Row Group 0
        age: UNCOMPRESSED
        name: UNCOMPRESSED
        timestamp: UNCOMPRESSED

If there are a lot of rows in a parquet file, the output of the above program would be:

Row Group 0:
    xxx: XXX
    xxx: XXX
Row Group 1:
    xxx: XXX
    xxx: XXX
Row Group 2:
    xxx: XXX
    xxx: XXX
...

which is kinda messy, so I am curious that what output format would suit this subcommand? Friendly ping @manojkarthick, any idea?

Manoj Karthick · Answer 2 · Fri May 05 2023 02:19:37 GMT+0800 (China Standard Time)

Thanks for looking into this @SteveLauC - I think it would be best to include the compression algorithm used at a column level in the pqrs schema --detailed command's output.

Sample output of pqrs schema --detailed:

column 0:
--------------------------------------------------------------------------------
column type: INT64
column path: "epochTime"
encodings: PLAIN BIT_PACKED
file path: N/A
file offset: 4
num of values: 1
total compressed size (in bytes): 71
total uncompressed size (in bytes): 69
data page offset: 4
index page offset: N/A
dictionary page offset: N/A
statistics: {min: 1672531499, max: 1672531499, distinct_count: N/A, null_count: 0, min_max_deprecated: false}

I think it would be great to add the compression information as another field in this list alongside the "total compressed size".

SteveLauC · Answer 3 · Fri May 05 2023 09:34:18 GMT+0800 (China Standard Time)

I think it would be best to include the compression algorithm used at a column level in the pqrs schema --detailed command's output.

That would be great, then I will work on it:)

Just took a look at the source code of pqrs schema, and it seems that we are using print_column_chunk_metadata() form parquet to print the metadata, I guess I need to add a patch to parquet first then

SteveLauC · Answer 4 · Fri May 05 2023 13:08:14 GMT+0800 (China Standard Time)

Hi @manojkarthick, would you like to get #41 merged first so that we don't need to tackle dependency compatibility problems when implementing the PR for this issue:)

Manoj Karthick · Answer 5 · Sat May 06 2023 02:07:29 GMT+0800 (China Standard Time)

Hi @manojkarthick, would you like to get #41 merged first so that we don't need to tackle dependency compatibility problems when implementing the PR for this issue:)

I've merged #41, let me know if you need anything else (:

SteveLauC · Answer 6 · Tue May 09 2023 21:27:36 GMT+0800 (China Standard Time)

I originally thought parquet#4176 could be included in release 39.0.0, but I was wrong, so we have to wait for the release of 40.0.0:(

Manoj Karthick · Answer 7 · Wed May 10 2023 01:36:09 GMT+0800 (China Standard Time)

I originally thought parquet#4176 could be included in release 39.0.0, but I was wrong, so we have to wait for the release of 40.0.0:(

@SteveLauC Maybe you want to use the git revision that includes this change in Cargo.toml and raise a PR while we wait for this to get merged? It could be updated to 40.0.0 whenever that gets released.