str_concat benchmark

Question

str_concat benchmark

artemru opened this issue 5 years ago · comments

Is there some benchmark on str_concat operation ?
On my local machine I've tried a naive python implem and got better result than with NumbaStringArray:

import numpy as np
import pyarrow as pa

from fletcher._numba_compat import NumbaStringArray, buffers_as_arrays
from fletcher._algorithms import str_concat

a1 = pa.array(np.random.rand(10**6).astype(str).astype('O'))
a2 = pa.array(np.random.rand(10**6).astype(str).astype('O'))


%timeit pa.array([x + y for x, y in zip(a1.to_pandas(), a2.to_pandas())])
# 860 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit str_concat(NumbaStringArray.make(a1), NumbaStringArray.make(a2))                                                           
# 1.11 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is it something that you expect ?

Uwe L. Korn · Answer 1 · Tue Feb 04 2020 21:17:49 GMT+0800 (China Standard Time)

No, this is slower than expected. I have removed the cited code in #100 and provided a better implementation that gives at least on my machine a 5x speedup.