xhochy / fletcher

Pandas ExtensionDType/Array backed by Apache Arrow

Home Page:https://fletcher.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Test integration with dask.dataframe

xhochy opened this issue Β· comments

dask.dataframe should also be able to handle fletcher columns and accessors. Thus we should have at least tests that confirm:

  • dask.dataframe can have fletcher.Fletcher{Chunked,Continuous}Array columns
  • The fr_text accessor is working with dask.dataframe

@TomAugspurger is this possible today?

For context, as a side project today I'm looking at text handling in dask dataframe. It seems to be a common concern in benchmarks, particularly due to memory-blowup.

Yes, since Thursday this is working on master: #147

@mrocklin This is already working since a year, see the dask blog https://blog.dask.org/2019/01/22/dask-extension-arrays πŸ˜ƒ

Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39

What is missing from the fletcher<->dask support is the fr_text accessor. If you want to play with it, I can quickly implement it, otherwise I would take a stab at that once I tackled #115.

The project here isn't yet fully functional but shows what is there in Arrow & pandas to support native string arrays. It was dormant for ~6 months as other things had a higher priority but we're now continuing in Arrow to build string kernels and will ship hopefully a lot of them in 1.0 / 1.1 in the next 2-3 months, making this setup here usable. If you have specific functionality you're looking for, just give us a heads-up and we can implement them first.

If you want to create such a column from a Parquet file without going through the object, check out the types_mapper argument of pyarrow.Tables.to_pandas. This also works for other ExtensionArray, not only fletcher. This can also save quite some overhead / GIL contentation.

import fletcher as fr
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'str': ['a', 'b', 'c']})
df.to_parquet("test.parquet")
table = pq.read_table("test.parquet")

table.to_pandas().info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
#  #   Column  Non-Null Count  Dtype 
# ---  ------  --------------  ----- 
#  0   str     3 non-null      object
# dtypes: object(1)
# memory usage: 152.0+ bytes

table.to_pandas(types_mapper={pa.string(): fr.FletcherChunkedDtype(pa.string())}.get).info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
#  #   Column  Non-Null Count  Dtype                   
# ---  ------  --------------  -----                   
#  0   str     3 non-null      fletcher_chunked[string]
# dtypes: fletcher_chunked[string](1)
# memory usage: 147.0 bytes