🚀 String Super-Issue

Question

🚀 String Super-Issue

xhochy opened this issue 4 years ago · comments

We want to support strings (UTF-8 encoded) as fast as possible inside of pandas. Therefore we need to implement several things. This will be split into many issues and hard to track just with the issue search, so we will list them all here.

We try to add the functionality in three stages:

Implement the functionality using plain Python operations. This will be the same speed as with pandas.StringDtype but already provides the API to fletcher users. This will allow us to add faster implementations bit-by-bit while already providing a fully usable library.
a) Also ensure that we have benchmarks setup to compare the pandas/object implementation to ours.
Given the algorithm isn't too complicated, we try to make an efficient implementation with numba. This will allow us to provide a fast algorithm with less implementation overhead then adding it to Apache Arrow.
For all methods, add an efficient implementation to Apache Arrow if there is none yet.

function	meta issue	naïve implementation	`numba` implementation	`pyarrow` implementation
`capitalize`	#124	#200	…	…
`casefold`	#125	#200	…	…
`cat`	#126	#200	…	…
`center`	#122	#200	…	…
`contains` (exact match) ✅	#123	#140	#141	ARROW-9160 / #151
`contains` (other)	#123	#200	…	…
`count`	#127	#200	…	…
`decode`	…	…	…
`encode`	…	…	…
`endswith`	#130	-	#131	…
`extract`	#137	#200	…	…
`extractall`	…	#200	…
`find`	…	#200	…
`findall`	…	#200	…
`get`	…	#200	…
`index`	…	#200	…
`join`	…	…	…
`len`	…	#200	…
`ljust`	…	#200	…
`lower`	#135	#200	…	ARROW-9133
`lstrip`	…	#200	…
`match`	…	#200	…
`normalize`	…	#200	…
`pad`	…	#200	…
`partition`	…	#200	…
`repeat`	…	#200	…
`replace`	#133	#200	…	…
`rfind`	…	#200	…	…
`rindex`	…	#200	…	…
`rjust`	…	#200	…	…
`rpartition`	…	#200	…	…
`rstrip`	…	#200	…	…
`slice`	#114	#200	…	…
`slice_replace`	…	#200	…	…
`split`	…	#200	…	…
`rsplit`	…	#200	…	…
`startswith`	#132	-	#131	…
`strip`	#136	–	#160	…
`swapcase`	…	#200	…	…
`title`	…	#200	…	…
`translate`	…	#200	…	…
`upper`	…	#200	…	ARROW-9133
`wrap`	…	#200	…	…
`zfill`	#134	#139	…	…
`isalnum` ✅	…	#200	…	ARROW-9268 / #203
`isalpha` ✅	…	#200	…	ARROW-9268 / #203
`isdigit` ✅	…	#200	…	ARROW-9268 / #203
`isspace` ✅	…	#200	…	ARROW-9268 / #203
`islower` ✅	…	#200	…	[ARROW-9268](apache/arrow#7656 / #203 )
`isupper` ✅	…	#200	…	ARROW-9268 / #203
`istitle` ✅	…	#200	…	ARROW-9268 / #203
`isnumeric` ✅	…	#200	…	ARROW-9268 / #203
`isdecimal` ✅	…	#200	…	ARROW-9268 / #203
`get_dummies`	…	#200	…	…

Uwe L. Korn · Answer 1 · Wed Feb 22 2023 23:15:02 GMT+0800 (China Standard Time)

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.