xhochy / fletcher

Pandas ExtensionDType/Array backed by Apache Arrow

Home Page:https://fletcher.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible to factor out the null bytemap and/or use more of arrow compute API?

davesque opened this issue · comments

I noticed that fletcher converts the null bitmap into a null bytemap as a step in many computations for arrays that have null values. Do you have any interest in eventually factoring this step out or accepting PRs that do? I think that would involve a fair bit of custom Cython or Numba code that manually iterates over the null bitmap along with the values buffer. But it might be worth doing and could narrow the gap or even overtake Pandas on some of the benchmarks in your benchmarking suite.

Also, I noticed a number of other places where it might be possible to make simple calls to the Arrow compute API. I made a simple modification to the FletcherBaseArray.sum method to just make a direct call to pyarrow.compute.sum. This does make it so that you can't specify any special behavior regarding nulls via skipna. However, it speeds things up by a lot (35-40% faster than Pandas or Fletcher). It makes me wonder if it wouldn't be worth implementing more of Fletcher's internals via Cython and Arrow's compute API.

What are your thoughts on these things?

This project has been archived as development has ceased around 2021.
With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled.