jcrobak / parquet-python

python implementation of the parquet columnar file format.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Writing parquet files

peterbe opened this issue · comments

Hi,
We need to be able to write python dicts to parquet. What are the chances that you'll have time to work on this? I.e. a writer class.

My team is totally new to parquet so we have a lot to learn. We did see #13 which claims to have a writer functionality but that PR is out-of-sync and tries to solve a couple of other things at the same time.

Would appreciate your thoughts on this project's near future.

cc @adngdb

If you wish to write dicts, as opposed to tabular data, you may be better off looking at avro. There are working python libraries, avro (official, slow), fastavro and cyavro.

My stats team say they want it stored in parquet (in S3). I have many individual big dicts that I want to store. Most of them are 1-level dicts, so it's quite tabular. All of it needs to happen from CPython, not a JVM.

In that case, you have two options: to wait for the ongoing work by the apache-arrow to enable the conversion of pandas dataframes to parquet (so, presumably, any data structure you can store in a dataframe), or - of course - to work on the writer in this project. I personally have no plans to work on it in the near future.

Thanks! I appreciate the update and tips. I'll try to get a handle on the state of Python support inside arrow. I see the code's there but skimming through it, I only see support (no idea of it's completion state) for readiing.