xhochy / fletcher

Pandas ExtensionDType/Array backed by Apache Arrow

Home Page:https://fletcher.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add str.count

xhochy opened this issue · comments

  • ✔️ pandas function
  • ✔️ Python function
  • ❌ C++ STL function
  • ❌ needs a regular expression library
  • ✔️ no need for a Unicode database for capitalization
  • ✔️ can pre-compute output size as return value is a numeric array

Pseudo-Code:

Inputs: pat

output = IntArray(len(rows))
for i, row in enumerate(rows):
    count = 0
    for offset in range(len(row)):
        if pat == row[offset:len(pat)]:
        count += 1
    output[i] = count

The provided pseudo-code has produces a different result than the pandas count function when there are overlapping matches. Consider

>>> s = pd.Series(['aaaaaa'])
>>> s.str.count('aaa')
0    2
dtype: int64

The pseudo-code above would instead find 4 matches. Python's implementation of count produces the same results:

>>> 'aaaaaa'.count('aaa')
2

A more accurate version of the pseudo-code would be something like

for i, row in enumerate(rows):
    count = 0
    offset = 0
    while offset + len(pat) <= len(row):
        if pat == row[offset : offset + len(pat)]:
            count += 1
            offset += len(pat)
        else:
            offset += 1
    output[i] = count

In #164 we have an implementation of count that produces the same result as the pandas / python functions.

Nice finding, I did ask myself this while reviewing your code. Can you comment in your implementation that this is done in such a way to match Python's behaviour?

Add an explanation as a comment in fletcher/algorithms/strings.py.