xhochy / fletcher

Pandas ExtensionDType/Array backed by Apache Arrow

Home Page:https://fletcher.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

πŸš€ String Super-Issue

xhochy opened this issue Β· comments

We want to support strings (UTF-8 encoded) as fast as possible inside of pandas. Therefore we need to implement several things. This will be split into many issues and hard to track just with the issue search, so we will list them all here.

We try to add the functionality in three stages:

  1. Implement the functionality using plain Python operations. This will be the same speed as with pandas.StringDtype but already provides the API to fletcher users. This will allow us to add faster implementations bit-by-bit while already providing a fully usable library.
    a) Also ensure that we have benchmarks setup to compare the pandas/object implementation to ours.
  2. Given the algorithm isn't too complicated, we try to make an efficient implementation with numba. This will allow us to provide a fast algorithm with less implementation overhead then adding it to Apache Arrow.
  3. For all methods, add an efficient implementation to Apache Arrow if there is none yet.
function meta issue naΓ―ve implementation numba implementation pyarrow implementation
capitalize #124 #200 … …
casefold #125 #200 … …
cat #126 #200 … …
center #122 #200 … …
contains (exact match) βœ… #123 #140 #141 ARROW-9160 / #151
contains (other) #123 #200 … …
count #127 #200 … …
decode … … …
encode … … …
endswith #130 - #131 …
extract #137 #200 … …
extractall … #200 …
find … #200 …
findall … #200 …
get … #200 …
index … #200 …
join … … …
len … #200 …
ljust … #200 …
lower #135 #200 … ARROW-9133
lstrip … #200 …
match … #200 …
normalize … #200 …
pad … #200 …
partition … #200 …
repeat … #200 …
replace #133 #200 … …
rfind … #200 … …
rindex … #200 … …
rjust … #200 … …
rpartition … #200 … …
rstrip … #200 … …
slice #114 #200 … …
slice_replace … #200 … …
split … #200 … …
rsplit … #200 … …
startswith #132 - #131 …
strip #136 – #160 …
swapcase … #200 … …
title … #200 … …
translate … #200 … …
upper … #200 … ARROW-9133
wrap … #200 … …
zfill #134 #139 … …
isalnum βœ… … #200 … ARROW-9268 / #203
isalpha βœ… … #200 … ARROW-9268 / #203
isdigit βœ… … #200 … ARROW-9268 / #203
isspace βœ… … #200 … ARROW-9268 / #203
islower βœ… … #200 … [ARROW-9268](apache/arrow#7656 / #203 )
isupper βœ… … #200 … ARROW-9268 / #203
istitle βœ… … #200 … ARROW-9268 / #203
isnumeric βœ… … #200 … ARROW-9268 / #203
isdecimal βœ… … #200 … ARROW-9268 / #203
get_dummies … #200 … …

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.