`merge` uses a lot of memory

Question

`merge` uses a lot of memory

FauxFaux opened this issue 3 years ago · comments

Feature request!

Is it possible for merge to merge files without decompressing and recompressing them?

My usecase:

My parquet generator makes 1GB row groups (in memory), and writes them to individual parquet files. <40MB on disc, one row group per file. (It does this because it can't be bothered to deal with schema variations, another problem.).

I'd like to concatenate these files; take the row group out of any that have the same schema, and make one big file with multiple row groups, with exactly the same schema?

The current merge implementation can do this, but needs >>200GB of memory to merge 8GB of parquet files, which is not ideal.

Manoj Karthick · Answer 1 · Thu Dec 30 2021 22:28:36 GMT+0800 (China Standard Time)

Hi @FauxFaux - Unfortunately, The current merge command implementation is a very naive (and inefficient as you noted) and is not intended to be used in production.

pqrs uses Apache Arrow for reading and writing parquet files which uses Arrow record batches rather than operating on parquet row groups - from my understanding on the Arrow File API, it is not possible to operate on it without reading the entire file into memory.

When reading record batches via the Arrow reader, currently a chunk size of 2048 rows is used, which might be really small for your usecase (which I suspect is one of the reasons for the very high memory usage). I can try to make that configurable so you can experiment to see the performance with varying chunk sizes. But I am not sure how much performance gain that would lead to.

The best way to merge these parquet files would be using a tool like Spark, Hadoop or Impala.

Chris West · Answer 2 · Thu Dec 30 2021 22:32:13 GMT+0800 (China Standard Time)

Don't worry about it! As you noted, I will probably need a custom solution anyway.

I also can't see anything in the Arrow API for accessing the row batches directly, but I wondered if you knew something more, or would consider considering this feature request in future; it doesn't seem like a crazy request to me.

Manoj Karthick · Answer 3 · Fri Dec 31 2021 13:20:52 GMT+0800 (China Standard Time)

Thanks! I will definitely look into improving the merge command if/when there's a newer API upstream in Arrow.