jcrobak / parquet-python

python implementation of the parquet columnar file format.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Example Code

pjakobsen opened this issue · comments

It would be great to have some example code to show how to use this interesting library. Trying to tease it out from the test cases has proven to be unsuccessful so far.

Hello, @pjakobsen! I'm having the same problem here: I'm implementing the Parquet rows plugin and needed to read the parquet-python's source code to know how to use it -- it's a difficult, non-pythonic way. So I've created a little helper function which can also help you:

from collections import namedtuple
import parquet

OPTIONS = namedtuple('Options', ['col', 'format'])(col=None, format='custom')

def import_data(filename):
    data, field_names = parquet.dump(filename, OPTIONS, lambda *args: args)
    length = len(data[field_names[0]])
    return [{field_name: data[field_name][index] for field_name in field_names}
            for index in range(length)]

The function is pretty straighforward to use, for example, this code:

parquet_rows = import_data('test-data/nation.dict.parquet')
for row in parquet_rows:
    print row

Will generate the following output (each row is a Python dict):

{'region_key': 0, 'nation_key': 0, 'name': 'ALGERIA', 'comment_col': ' haggle. carefully final deposits detect slyly agai'}
{'region_key': 1, 'nation_key': 1, 'name': 'ARGENTINA', 'comment_col': 'al foxes promise slyly according to the regular accounts. bold requests alon'}
(... 24 more rows ...)

@pjakobsen, you can now use my library rows to read and convert parquet files! :) More information on this blog post.

I took a try for a more pythonic API in #11. PTAL!