Test integration with dask.dataframe
xhochy opened this issue Β· comments
dask.dataframe
should also be able to handle fletcher
columns and accessors. Thus we should have at least tests that confirm:
dask.dataframe
can havefletcher.Fletcher{Chunked,Continuous}Array
columns- The
fr_text
accessor is working withdask.dataframe
@TomAugspurger is this possible today?
For context, as a side project today I'm looking at text handling in dask dataframe. It seems to be a common concern in benchmarks, particularly due to memory-blowup.
Yes, since Thursday this is working on master: #147
@mrocklin This is already working since a year, see the dask blog https://blog.dask.org/2019/01/22/dask-extension-arrays π
Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39
What is missing from the fletcher<->dask support is the fr_text
accessor. If you want to play with it, I can quickly implement it, otherwise I would take a stab at that once I tackled #115.
The project here isn't yet fully functional but shows what is there in Arrow & pandas to support native string arrays. It was dormant for ~6 months as other things had a higher priority but we're now continuing in Arrow to build string kernels and will ship hopefully a lot of them in 1.0 / 1.1 in the next 2-3 months, making this setup here usable. If you have specific functionality you're looking for, just give us a heads-up and we can implement them first.
If you want to create such a column from a Parquet file without going through the object
, check out the types_mapper
argument of pyarrow.Tables.to_pandas
. This also works for other ExtensionArray
, not only fletcher
. This can also save quite some overhead / GIL contentation.
import fletcher as fr
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({'str': ['a', 'b', 'c']})
df.to_parquet("test.parquet")
table = pq.read_table("test.parquet")
table.to_pandas().info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 str 3 non-null object
# dtypes: object(1)
# memory usage: 152.0+ bytes
table.to_pandas(types_mapper={pa.string(): fr.FletcherChunkedDtype(pa.string())}.get).info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 str 3 non-null fletcher_chunked[string]
# dtypes: fletcher_chunked[string](1)
# memory usage: 147.0 bytes