Using .str functions

Question

Using .str functions

birdsarah opened this issue 5 years ago · comments

I have tried, perhaps incorrectly, to convert my column to pyarrow string type as follows:

fletcher_string_dtype = fr.FletcherDtype(pa.string())
df['string_col'] = df.string_col.astype(fletcher_string_type)

But now I can't do string functions on it because I get the error message AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Specifically, I'm trying to do .str.contains()

I may be casting column incorrectly. It may be that there's no value in using fletcher for this.

I saw in your talk, groupby was a nice use case. Related to this question is what are the best use cases for this dtype - just a link to some additional reading material would be great.

Tom Augspurger · Answer 1 · Mon Apr 22 2019 22:41:44 GMT+0800 (China Standard Time)

I think fletcher uses the .text accessor, instead of .str.

Uwe L. Korn · Answer 2 · Mon Apr 22 2019 23:10:57 GMT+0800 (China Standard Time)

str is an accessor for only the object type columns that have str/unicode in them. These methods are not suitable for flechter columns. Thus we will use .text and implement all methods of .str there but with support for fletcher / Arrow data.

Dave Hirschfeld · Answer 3 · Tue Apr 23 2019 04:48:20 GMT+0800 (China Standard Time)

It seems a bit awkward to provide the same functionality of .str under a different name.

Sure, the current implementation is for Python string objects but can't you just override that for a fletcher/arrow dtype so that the same functionality is provided at the same location/name?

I'm interested as I'm using distributed and arrow for some ETL jobs and I'd like to be able to do some basic transforms without having to convert back and forth between pandas and arrow all the time.

Obviously, it's much easier if the same transform code will work for both pandas and fletcher. If OTOH fletcher provides the functionality of .str under .text I'll need to have 2 separate implementations to compare them.

Sarah Bird · Answer 4 · Tue Apr 23 2019 08:06:37 GMT+0800 (China Standard Time)

Just to clarify though. None of these are actually implemented currently, correct? I get "Series object has no attribute 'text'"

Uwe L. Korn · Answer 5 · Tue Apr 23 2019 18:01:08 GMT+0800 (China Standard Time)

@dhirschfeld We can extend .text to use fletcher's method on a FletcherArray and fallback to pandas's when the column is not fletcher-based. @TomAugspurger Or do you know how to extend .str in fletcher so that it also can handle FletcherArray?

@birdsarah We have only a very small set of methods implemented at the moment in https://github.com/xhochy/fletcher/blob/master/fletcher/string_array.py But after an initial import from this module, you should be able to use .text on a Series; the import is quite important though.

Tom Augspurger · Answer 6 · Tue Apr 23 2019 20:17:50 GMT+0800 (China Standard Time)

Pandas doesn’t give fletcher any way to use .str. I don’t think we should since I’m interested in properly supporting strings in pandas sometime this year.

…

On Apr 23, 2019, at 05:01, Uwe L. Korn ***@***.***> wrote: @dhirschfeld We can extend .text to use fletcher's method on a FletcherArray and fallback to pandas's when the column is not fletcher-based. @TomAugspurger Or do you know how to extend .str in fletcher so that it also can handle FletcherArray? @birdsarah We have only a very small set of methods implemented at the moment in https://github.com/xhochy/fletcher/blob/master/fletcher/string_array.py But after an initial import from this module, you should be able to use .text on a Series; the import is quite important though. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dave Hirschfeld · Answer 7 · Wed Apr 24 2019 05:07:56 GMT+0800 (China Standard Time)

Pandas doesn’t give fletcher any way to use .str. I don’t think we should since I’m interested in properly supporting strings in pandas sometime this year.

Unless you're suggesting that the pandas default implementation will work directly with arrow data (in fletcher arrays) I'd disagree with this position - I don't want to be forced to coerce my arrow data to pandas to do basic manipulations and I also don't want the maintenance burden of 2 separate implementations.

I think pandas should make both .str and .dt available to be overridden by different (extension) dtypes with implementations that work / make sense / are performant for that data type.

The concept is similar to numpy's __array_function__ protocol whereby different array implementations can override the default numpy implementation thereby allowing users to write generic code that works for numpy arrays, cupy arrays, sparse arrays, etc...

I'd like my transform functions to work seamlessly with either python/pandas strings or with arrow/fletcher strings. Of course, I don't know if this may be an unreasonable hope given technical constraints but I think it's something worth striving for with the benefits similar to that provided by numpy's NEP-18.

Tom Augspurger · Answer 8 · Wed Apr 24 2019 05:55:10 GMT+0800 (China Standard Time)

On Apr 23, 2019, at 16:07, Dave Hirschfeld ***@***.***> wrote: Pandas doesn’t give fletcher any way to use .str. I don’t think we should since I’m interested in properly supporting strings in pandas sometime this year. Unless you're suggesting that the pandas default implementation will work directly with arrow data (in fletcher arrays)

That’s what I’m suggesting.

…

I'd disagree with this position - I don't want to be forced to coerce my arrow data to pandas to do basic manipulations and I also don't want the maintenance burden of 2 separate implementations. I think pandas should make both .str and .dt available to be overridden by different (extension) dtypes with implementations that work / make sense / are performant for that data type. The concept is similar to numpy's __array_function__ protocol whereby different array implementations can override the default numpy implementation thereby allowing users to write generic code that works for numpy arrays, cupy arrays, sparse arrays, etc... I'd like my transform functions to work seamlessly with either python/pandas strings or with arrow/fletcher strings. Of course, I don't know if this may be an unreasonable hope given technical constraints but I think it's something worth striving for with the benefits similar to that provided by numpy's NEP-18. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Uwe L. Korn · Answer 9 · Wed Feb 22 2023 23:14:36 GMT+0800 (China Standard Time)

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.