Test integration with dask.dataframe

Question

Test integration with dask.dataframe

xhochy opened this issue 4 years ago · comments

dask.dataframe should also be able to handle fletcher columns and accessors. Thus we should have at least tests that confirm:

dask.dataframe can have fletcher.Fletcher{Chunked,Continuous}Array columns
The fr_text accessor is working with dask.dataframe

Matthew Rocklin · Answer 1 · Mon Jun 29 2020 01:01:36 GMT+0800 (China Standard Time)

@TomAugspurger is this possible today?

Matthew Rocklin · Answer 2 · Mon Jun 29 2020 01:02:22 GMT+0800 (China Standard Time)

For context, as a side project today I'm looking at text handling in dask dataframe. It seems to be a common concern in benchmarks, particularly due to memory-blowup.

Uwe L. Korn · Answer 3 · Mon Jun 29 2020 02:06:47 GMT+0800 (China Standard Time)

Yes, since Thursday this is working on master: #147

Uwe L. Korn · Answer 4 · Mon Jun 29 2020 14:22:31 GMT+0800 (China Standard Time)

@mrocklin This is already working since a year, see the dask blog https://blog.dask.org/2019/01/22/dask-extension-arrays 😃

Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39

What is missing from the fletcher<->dask support is the fr_text accessor. If you want to play with it, I can quickly implement it, otherwise I would take a stab at that once I tackled #115.

The project here isn't yet fully functional but shows what is there in Arrow & pandas to support native string arrays. It was dormant for ~6 months as other things had a higher priority but we're now continuing in Arrow to build string kernels and will ship hopefully a lot of them in 1.0 / 1.1 in the next 2-3 months, making this setup here usable. If you have specific functionality you're looking for, just give us a heads-up and we can implement them first.

Matthew Rocklin · Answer 5 · Mon Jun 29 2020 21:44:03 GMT+0800 (China Standard Time)

Ah, that's a nice blog. I should probably check it out more often :) Mostly right now I'm just exploring this space. Short term I would be curious how we convert existing columns of text data in a dask dataframe to use Fletcher. The examples today all seem to start from a Pandas dataframe, which is an atypical starting point in the real world. Given a dask series of object dtype, how does one make a dask series backed by fletcher arrays?

…

On Sun, Jun 28, 2020 at 11:22 PM Uwe L. Korn ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> This is already working since a year, see the dask blog https://blog.dask.org/2019/01/22/dask-extension-arrays 😃 Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39 <ContinuumIO/cyberpandas#39> What is missing from the fletcher<->dask support is the fr_text accessor. If you want to play with it, I can quickly implement it, otherwise I would take a stab at that once I tackled #115 <#115>. The project here isn't yet fully functional but shows what is there in Arrow & pandas to support native string arrays. It was dormant for ~6 months as other things had a higher priority but we're now continuing in Arrow to build string kernels and will ship hopefully a lot of them in 1.0 / 1.1 in the next 2-3 months, making this setup here usable. If you have specific functionality you're looking for, just give us a heads-up and we can implement them first. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#107 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTDFFZH4OE4XN6F32WLRZAXLHANCNFSM4K2F3OTA> .

Matthew Rocklin · Answer 6 · Mon Jun 29 2020 21:44:45 GMT+0800 (China Standard Time)

Ah, I'm seeing https://gist.github.com/xhochy/edb7d01364db87ed2e44ac828472e663 now

…

On Mon, Jun 29, 2020 at 6:43 AM Matthew Rocklin ***@***.***> wrote: Ah, that's a nice blog. I should probably check it out more often :) Mostly right now I'm just exploring this space. Short term I would be curious how we convert existing columns of text data in a dask dataframe to use Fletcher. The examples today all seem to start from a Pandas dataframe, which is an atypical starting point in the real world. Given a dask series of object dtype, how does one make a dask series backed by fletcher arrays? On Sun, Jun 28, 2020 at 11:22 PM Uwe L. Korn ***@***.***> wrote: > @mrocklin <https://github.com/mrocklin> This is already working since a > year, see the dask blog > https://blog.dask.org/2019/01/22/dask-extension-arrays 😃 > > Not sure why cyberpandas hasn't merged it yet: ContinuumIO/cyberpandas#39 > <ContinuumIO/cyberpandas#39> > > What is missing from the fletcher<->dask support is the fr_text > accessor. If you want to play with it, I can quickly implement it, > otherwise I would take a stab at that once I tackled #115 > <#115>. > > The project here isn't yet fully functional but shows what is there in > Arrow & pandas to support native string arrays. It was dormant for ~6 > months as other things had a higher priority but we're now continuing in > Arrow to build string kernels and will ship hopefully a lot of them in 1.0 > / 1.1 in the next 2-3 months, making this setup here usable. If you have > specific functionality you're looking for, just give us a heads-up and we > can implement them first. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#107 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTDFFZH4OE4XN6F32WLRZAXLHANCNFSM4K2F3OTA> > . >

Uwe L. Korn · Answer 7 · Mon Jun 29 2020 22:00:48 GMT+0800 (China Standard Time)

If you want to create such a column from a Parquet file without going through the object, check out the types_mapper argument of pyarrow.Tables.to_pandas. This also works for other ExtensionArray, not only fletcher. This can also save quite some overhead / GIL contentation.

import fletcher as fr
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'str': ['a', 'b', 'c']})
df.to_parquet("test.parquet")
table = pq.read_table("test.parquet")

table.to_pandas().info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
#  #   Column  Non-Null Count  Dtype 
# ---  ------  --------------  ----- 
#  0   str     3 non-null      object
# dtypes: object(1)
# memory usage: 152.0+ bytes

table.to_pandas(types_mapper={pa.string(): fr.FletcherChunkedDtype(pa.string())}.get).info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 1 columns):
#  #   Column  Non-Null Count  Dtype                   
# ---  ------  --------------  -----                   
#  0   str     3 non-null      fletcher_chunked[string]
# dtypes: fletcher_chunked[string](1)
# memory usage: 147.0 bytes