Add str.count
xhochy opened this issue · comments
- ✔️ pandas function
- ✔️ Python function
- ❌ C++ STL function
- ❌ needs a regular expression library
- ✔️ no need for a Unicode database for capitalization
- ✔️ can pre-compute output size as return value is a numeric array
Pseudo-Code:
Inputs: pat
output = IntArray(len(rows))
for i, row in enumerate(rows):
count = 0
for offset in range(len(row)):
if pat == row[offset:len(pat)]:
count += 1
output[i] = count
The provided pseudo-code has produces a different result than the pandas count function when there are overlapping matches. Consider
>>> s = pd.Series(['aaaaaa'])
>>> s.str.count('aaa')
0 2
dtype: int64
The pseudo-code above would instead find 4
matches. Python's implementation of count
produces the same results:
>>> 'aaaaaa'.count('aaa')
2
A more accurate version of the pseudo-code would be something like
for i, row in enumerate(rows):
count = 0
offset = 0
while offset + len(pat) <= len(row):
if pat == row[offset : offset + len(pat)]:
count += 1
offset += len(pat)
else:
offset += 1
output[i] = count
In #164 we have an implementation of count
that produces the same result as the pandas / python functions.
Nice finding, I did ask myself this while reviewing your code. Can you comment in your implementation that this is done in such a way to match Python's behaviour?
Add an explanation as a comment in fletcher/algorithms/strings.py.