xhochy / fletcher

Pandas ExtensionDType/Array backed by Apache Arrow

Home Page:https://fletcher.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

str_concat benchmark

artemru opened this issue · comments

Is there some benchmark on str_concat operation ?
On my local machine I've tried a naive python implem and got better result than with NumbaStringArray:

import numpy as np
import pyarrow as pa

from fletcher._numba_compat import NumbaStringArray, buffers_as_arrays
from fletcher._algorithms import str_concat

a1 = pa.array(np.random.rand(10**6).astype(str).astype('O'))
a2 = pa.array(np.random.rand(10**6).astype(str).astype('O'))


%timeit pa.array([x + y for x, y in zip(a1.to_pandas(), a2.to_pandas())])
# 860 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit str_concat(NumbaStringArray.make(a1), NumbaStringArray.make(a2))                                                           
# 1.11 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is it something that you expect ?

No, this is slower than expected. I have removed the cited code in #100 and provided a better implementation that gives at least on my machine a 5x speedup.