jcrobak / parquet-python

python implementation of the parquet columnar file format.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider major rewrite

jcrobak opened this issue · comments

@martindurant has put together a shiny new implementation that improves performance, adds interop with dataframes libraries, and adds write support. See dask#3

The major changes are new interfaces and dependencies on several new packages (numpy, pandas, numba, dask). I'd love feedback from folks using parquet-python on how invasive those changes would be...especially given the historic problems installing some of those libraries.

Please let me know what you think. Some folks that have contributed and may have an opinion include @SergeNov @turicas @spaztic1215 but anyone is welcome to chime in!

The interoperability with dataframes and potential efficiencies with dask's task scheduling is exciting but unfortunately for Hue, we would have to omit the dependencies on numba and possibly dask due to licensing constraints. It would be nice to make these optional dependencies if these changes are to be pulled into parquet-python.

My guess is that Dask is optional but that Numba is not.

Out of curiosity, why is Numba a problem but not NumPy or Pandas (which have the same license). Is there another constraint other than the choice of license that is active here?

BSD and MIT licenses are generally fine, but we'd have to check on some of numba's dependencies for Python 2 (Hue still has to support Python 2.6 for now) like funcsigs.

I have made no particular effort yet to make my code compatible with python 2 while I am still developing core functionality, but I don't suppose it should be too onerous.

Interoperability with dataframes and other structures are very important, but I think it should not be mandatory, since there are many use cases when installing all those libraries will be overkill, for example: what if I just want to extract data from a parquet file and convert it to a CSV?
If the entire architecture is well documented and modular, I think we could have some extra features available if these libraries are installed, but the bare minimum to read/write parquet files should work without it.

Please don't rewrite, this is the only library written with understandable code, unlike parquet-mr and parquet-cpp which are easier to rewrite than read the code.

@aloneguid , I agree that this library is nicely written and blissfully few lines of code. I have attempted to make my version, which forked from here, respect this style and believe (although this is subjective) that the result is very hackable.

To everyone: we have announced beta status here and the github repo is now here with docs on RTD.