manojkarthick / pqrs

Command line tool for inspecting Parquet files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`merge` uses a lot of memory

FauxFaux opened this issue · comments

Feature request!

Is it possible for merge to merge files without decompressing and recompressing them?


My usecase:

My parquet generator makes 1GB row groups (in memory), and writes them to individual parquet files. <40MB on disc, one row group per file. (It does this because it can't be bothered to deal with schema variations, another problem.).

I'd like to concatenate these files; take the row group out of any that have the same schema, and make one big file with multiple row groups, with exactly the same schema?

The current merge implementation can do this, but needs >>200GB of memory to merge 8GB of parquet files, which is not ideal.

Hi @FauxFaux - Unfortunately, The current merge command implementation is a very naive (and inefficient as you noted) and is not intended to be used in production.

pqrs uses Apache Arrow for reading and writing parquet files which uses Arrow record batches rather than operating on parquet row groups - from my understanding on the Arrow File API, it is not possible to operate on it without reading the entire file into memory.

When reading record batches via the Arrow reader, currently a chunk size of 2048 rows is used, which might be really small for your usecase (which I suspect is one of the reasons for the very high memory usage). I can try to make that configurable so you can experiment to see the performance with varying chunk sizes. But I am not sure how much performance gain that would lead to.

The best way to merge these parquet files would be using a tool like Spark, Hadoop or Impala.

Don't worry about it! As you noted, I will probably need a custom solution anyway.

I also can't see anything in the Arrow API for accessing the row batches directly, but I wondered if you knew something more, or would consider considering this feature request in future; it doesn't seem like a crazy request to me.

Thanks! I will definitely look into improving the merge command if/when there's a newer API upstream in Arrow.