[FYI] Filtering Benchmark

Question

[FYI] Filtering Benchmark

dhirschfeld opened this issue 4 years ago · comments

If I convert a pa.Table to a pandas DataFrame I pay for the cost of conversion up front but then it seems operations such as filtering are 2x faster than on a fletcher DataFrame:

In [37]: %time df = tbl.to_pandas()
Wall time: 602 ms

In [38]: df.dtypes
Out[38]: 
date_inserted_utc    datetime64[ns]
date_created_utc     datetime64[ns]
issue_date_utc       datetime64[ns]
data_provider                 int64
weather_station               int64
weather_variable              int64
value_date_utc       datetime64[ns]
value                       float64
dtype: object

In [39]: %time wv1 = df[df['weather_variable'] == 1]
Wall time: 423 ms

In [34]: %time df = fr.pandas_from_arrow(tbl)
Wall time: 2 ms

In [35]: df.dtypes
Out[35]: 
date_inserted_utc    fletcher_chunked[timestamp[us]]
date_created_utc     fletcher_chunked[timestamp[us]]
issue_date_utc       fletcher_chunked[timestamp[us]]
data_provider                fletcher_chunked[int64]
weather_station              fletcher_chunked[int64]
weather_variable             fletcher_chunked[int64]
value_date_utc       fletcher_chunked[timestamp[us]]
value                       fletcher_chunked[double]
dtype: object

In [36]: %time wv1 = df[df['weather_variable'] == 1]
Wall time: 897 ms

Just posting here in case benchmarks on real data are of interest.

Dave Hirschfeld · Answer 1 · Wed Jun 10 2020 11:53:36 GMT+0800 (China Standard Time)

More benchmarks:

In [49]: pk_cols = ['data_provider', 'weather_station', 'weather_variable','issue_date_utc', 'value_date_utc']

In [50]: pandas_df = pandas_df.set_index(pk_cols)

In [51]: fletcher_df = fletcher_df.set_index(pk_cols)

fletcher is only slightly slower for mean:

In [56]: %timeit pandas_df['value'].mean()
36.3 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [57]: %timeit fletcher_df['value'].mean()
45 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Interestingly pandas has some sort of performance bug for sum!

In [54]: %timeit pandas_df['value'].sum()
267 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit fletcher_df['value'].sum()
44.8 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

fletcher seems to have a pretty large performance penalty for joining and multiplying:

In [60]: %timeit (pandas_df['value']*pandas_df['value'])
57.8 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [61]: %timeit (fletcher_df['value']*fletcher_df['value'])
127 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Uwe L. Korn · Answer 2 · Wed Jun 10 2020 13:59:28 GMT+0800 (China Standard Time)

These are of interest, it would though be better to make several issues instead of a collector one (you can keep it as a meta-issue). Then we can work through them one-by-one and make it slowly faster.

Most of the behaviour is expected but not all:

The filtering is unexpected, I would have expected to be at least on the same speed as pandas.
Arithmetic operations should be 10-20% slower than they are in pandas currently as we dispatch to numpy and then apply the valid mask afterwards again. With newer Arrow versions, we should be able to use arithmetic from Arrow and be on par with pandas again.
The pandas performance penalty for sum could root from it special nan-handling, have you tried running it while bottleneck is installed? That might improve pandas' performance.

Joris Van den Bossche · Answer 3 · Wed Jun 10 2020 14:26:43 GMT+0800 (China Standard Time)

The filtering is unexpected, I would have expected to be at least on the same speed as pandas.

Arrow's take is slower than numpy's take, I have noticed earlier, so that might be related (and Wes is speeding up take right now)

The pandas performance penalty for sum could root from it special nan-handling, have you tried running it while bottleneck is installed?

Or compare with the nullable integer dtyped column, where we got rid of this nan-handling penalty (although this is only in master)

Dave Hirschfeld · Answer 4 · Wed Jun 10 2020 14:30:07 GMT+0800 (China Standard Time)

it would though be better to make several issues instead of a collector one

I might do that once I've written a script to generate some dummy data so others can repro it.

I do actually have bottleneck installed (and am on current latest of ~all installed packages/deps)

# Name                    Version                   Build  Channel
bottleneck                1.3.2            py37hbc2f12b_1    https://conda.anaconda.org/conda-forge

Uwe L. Korn · Answer 5 · Wed Feb 22 2023 23:15:07 GMT+0800 (China Standard Time)

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.