AutoMap initialization from `np.ndarray` appears to be much slower than from `list`

Question

AutoMap initialization from `np.ndarray` appears to be much slower than from `list`

flexatone opened this issue 3 years ago · comments

In investigating StaticFrame performance I observed that initializing a AutoMap or FrozenAutoMap from an np.ndarray appears to be an order of magnitude slower than using a Python list. Even adding the overhead of tolist() on the array delivers better overall performance. Should we only deliver lists to these constructors, or is there something we can optimize in the implementation of initialization from arrays?

>>> labels = list(range(10_000))                                                          
>>> labels_list = list(range(10_000))                                                     
>>> labels_array = np.array(labels_list)                                                  
>>> %timeit automap.FrozenAutoMap(labels_list)                                            
107 µs ± 502 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit automap.FrozenAutoMap(labels_array)                                           
1.01 ms ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit automap.FrozenAutoMap(labels_array.tolist())                                  
307 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Notice also how the Pandas Int64Index has inverse characteristics (lists are slower than arrays), while initialization from an array is nearly an order of magnitude faster than AutoMap for lists.

>>> type(pd.Index(labels_array))                                                          
pandas.core.indexes.numeric.Int64Index
>>> %timeit pd.Index(labels_list)                                                         
851 µs ± 8.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit pd.Index(labels_array)                                                        
18 µs ± 82.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I wonder if AutoMap can take advantage of receiving immutable NumPy arrays from StaticFrame to avoid coping keys entirely.

Brandt Bucher · Answer 1 · Fri May 14 2021 07:37:08 GMT+0800 (China Standard Time)

Internally, automap objects stores their keys in a list. Upon instantiation, the equivalent of list(keys) is called on the keys passed in.

For lists, this is a simple memcpy + Py_INCREF for each element. For other iterables, however, this requires iterating through each element using the iteration protocol, which is slower. I believe this accounts for most of the difference:

>>> keys_list = list(range(10_000))
>>> keys_array = np.arange(10_000)
>>> %timeit list(keys_list)
21.5 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit list(keys_array)
639 µs ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I think the rest of the difference can be attributed to the hashing speed of a Python int vs a NumPy int64. It appears that hashing the latter is noticeably slower:

>>> %timeit set(keys_list)
107 µs ± 982 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> keys_array_list = list(keys_array)
>>> %timeit set(keys_array_list)
234 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(Note that list(keys_array) creates a list of np.int64 objects, but keys_array.tolist() creates a list of int objects!)

Brandt Bucher · Answer 2 · Fri May 14 2021 07:43:03 GMT+0800 (China Standard Time)

Notice also how the Pandas Int64Index has inverse characteristics (lists are slower than arrays), while initialization from an array is nearly an order of magnitude faster than AutoMap for lists.

I'm guessing that they store their keys as a NumPy array under-the-hood. Copying those is even faster than lists, since you just do the memcpy (or take a view) and skip the Py_INCREF step entirely!

Christopher Ariza · Answer 3 · Tue May 18 2021 03:12:06 GMT+0800 (China Standard Time)

Thanks for these thoughts. Independent of what Int64Index is doing, it would be great if AutoMap could more efficiently read-in a 1D NumPy array. At present I am calling tolist() prior to calling AutoMap, and that delivers valuable performance benefits. However, it is creating a potentially large intermediary list that (I believe) we are just going to throw away.

Brandt Bucher · Answer 4 · Tue May 18 2021 03:17:29 GMT+0800 (China Standard Time)

Yeah, that seems like a valuable avenue for exploration!

Christopher Ariza · Answer 5 · Wed Aug 24 2022 23:59:10 GMT+0800 (China Standard Time)

Hi @brandtbucher : I have some ideas for how to optimize the scenario of creating FrozenAutoMap from an immutable array and would like to attempt an implementation.

My idea is to have AutoMap use fam_new (rename it am_new): there, we always want to get a fresh list of keys via PySequence_List(keys).

Then, we implement a new fam_new that checks if the passed in keys is an immutable array and, if so, simple holds on to it (with an Py_INCREF too), not calling PySequence_List(keys). Some (or all) usage of self->keys will need to be updated to handle the context of an array and not a list, but I think those changes will be modest.

What I am not sure about is whether this new FrozenAutoMap should support both array and non-array input (making the necessary call to PySequence_List(keys) when, say, a mutable iterable is given). Or, instead, we could keep FrozenAutoMap the same and create an alternative sub-type, FrozenArrayAutoMap, that only accepts immutable arrays. This might make the implementation cleaner but introduces a new type and methods.

Of course, since we will be using the NumPy API this will also make the entire package dependent on NumPy, but I am familiar with what is necessary from ArrayKit.

Let me know your thoughts; happy to have a chat if easier to discuss.

Brandt Bucher · Answer 6 · Thu Aug 25 2022 05:27:36 GMT+0800 (China Standard Time)

I'm still open to letting people explore this (especially since StaticFrame is our ~~biggest~~ only client), but it still seems to me that the design would be much more complicated and less performant than you indicate.

For example, we benefit greatly from the knowledge that the keys member is always a list of strong references in critical parts of the logic for lookups, table construction/resizing, and iteration over views. With this change, much of that logic would probably be either slowed down or duplicated.

I'm also not entirely convinced of the benefit (something that an implementation and performance numbers alone can probably answer). We still have to build a hash table out of something, and without concrete Python objects, it seems to me that we either need to:

...have NumPy realize them for us as native Python types whenever hashing or equality comparisons are needed. But we need them all for table construction anyways, so this is no better than array.tolist().
...have NumPy realize them for us as scalar types. This has the same drawback as the above bullet, plus, as we've seen in this issue, they are more expensive to construct, hash, and compare.
...or, have AutoMap maintain an intimate knowledge of how to compute hash and equality for the full spectrum of NumPy dtypes (which for me is a total non-starter).

My current view is that this is a ton of work (and likely slowdown for the non-NumPy cases) for something that probably won't perform much better than just having the caller do array.tolist() first instead.

TL;DR: Feel free to attempt an implementation, but I don't personally think it's worth our time and likely won't merge it if it adds significant complexity for negligible gain!

Christopher Ariza · Answer 7 · Fri Aug 26 2022 00:49:39 GMT+0800 (China Standard Time)

Many thanks, @brandtbucher , for your thoughts on this. As the overwhelming use case in StaticFrame is initialization of FrozenAutoMaps from immutable NumPy arrays, I still think this is worth exploring. Of course, I absolutely do not want significant complexity for negligible gain!

Your point about the additional cost of comparing and hashing NumPy scalars is certainly relevant. But I wonder if, instead of creating NumPy scalars, we can use underlying byte data (for non-object dtypes) instead of NumPy scalars. This is kinda like a short cut to what you refer to as "intimate knowledge of how to compute hash and equality for the full spectrum of NumPy dtypes", but is actually quite simple and performant to access and use for a known (non-object) dtype. I have not thought through all the implications of this but it is just a preliminary speculation.

I will let you know if I can make any progress on this.

Brandt Bucher · Answer 8 · Fri Aug 26 2022 03:38:55 GMT+0800 (China Standard Time)

Your point about the additional cost of comparing and hashing NumPy scalars is certainly relevant. But I wonder if, instead of creating NumPy scalars, we can use underlying byte data (for non-object dtypes) instead of NumPy scalars. This is kinda like a short cut to what you refer to as "intimate knowledge of how to compute hash and equality for the full spectrum of NumPy dtypes", but is actually quite simple and performant to access and use for a known (non-object) dtype. I have not thought through all the implications of this but it is just a preliminary speculation.

It does seem attractive to just say "each n-length chunk of bytes is a key" and define simple(ish?) hash and equality functions for them, but that only works when compared to other chunks of bytes. As soon as types are involved, it gets much nastier. For example, how do I know if a (Python or NumPy) integer compares equal to a chunk of bytes in my floating-point array? If the answer is "convert it to a scalar of the correct dtype first, and use those bytes", then it seems like we're just shifting work from the initialization to every element access. AutoMap already goes to great lengths to avoid performing any extra work (allocations, equality comparisons between Python objects, etc.) during element access, and we still only beat Python's dict lookup times by a tiny bit (something like 5%, last time I measured).

If we're okay limiting the element accesses to NumPy scalars of the exact same dtype, then sure, it could work and probably be performant. But that assumes that your users are okay with keeping any value that could potentially be looked up as a key in the mapping (strings, ints, etc.) as NumPy scalars, or else suffer a surprising performance hit when indexing.

But now I'm weighing this additional burden on the user against the cost of just having SF do array.tolist() on index creation. Is that move (plus AutoMap's subsequent copy of the list) really a bottleneck in real index creation workloads?

Brandt Bucher · Answer 9 · Fri Aug 26 2022 03:55:42 GMT+0800 (China Standard Time)

AutoMap already goes to great lengths to avoid performing any extra work (allocations, equality comparisons between Python objects, etc.) during element access, and we still only beat Python's dict lookup times by a tiny bit (something like 5%, last time I measured).

As a mostly-unrelated performance side note, somebody shared a talk on the super interesting design of Google's SwissTable, which has similar design goals as us but uses SIMD instructions in really interesting ways. I've been thinking about prototyping an implementation of it for AutoMap, but haven't found the time for it. So this situation could certainly be improved if I (or somebody else) picks up the work.

Brandt Bucher · Answer 10 · Fri Aug 26 2022 03:59:26 GMT+0800 (China Standard Time)

Is that move (plus AutoMap's subsequent copy of the list) really a bottleneck in real index creation workloads?

Another note: if the list copy is shown to be expensive, a simple solution could be to provide an alternate constructor that just skips it and grabs a reference.

Christopher Ariza · Answer 11 · Fri Aug 26 2022 05:12:34 GMT+0800 (China Standard Time)

Yes, I considered your point about how to compare non-NumPy types with a byte-encoded table and agree, it likely will end up just shifting the work... but maybe that is worth it. And I also thought about just holding a reference to the passed list, but having a lingering reference to a mutable outside of the AutoMap seems too sketchy!

As far as the cost of the call tolist(), we can see that it accounts for more than half of the total time to create a StaticFrame Index. And in the vast majority of cases, we are creating these Index objects from immutable arrays.

>>> keys_array = np.arange(10_000)  
>>> %timeit sf.Index(keys_array)            
260 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit automap.FrozenAutoMap(keys_array.tolist())                 
247 µs ± 5.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each

>>> %timeit keys_array.tolist()                           
166 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> keys_list = keys_array.tolist()     
>>> %timeit automap.FrozenAutoMap(keys_list)      
85.3 µs ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Further, and relating to some of your other comments, it might even be better to sacrifice some lookup performance for instantiation performance. As StaticFrame inevitably creates many derived containers (each with indices) we might end up spending more time creating indices than looking up values in them.

Brandt Bucher · Answer 12 · Fri Aug 26 2022 06:17:08 GMT+0800 (China Standard Time)

Wow, that is expensive. I had no idea it accounted for two-thirds of the total creation time!

I'm still wary of the impact on lookup performance, but you have convinced me that array.tolist() is clearly a bottleneck to index creation. Explore away!

Christopher Ariza · Answer 13 · Tue Mar 21 2023 23:55:49 GMT+0800 (China Standard Time)

Updating the examples shown above.

Current AutoMap on master:

>>> from automap import FrozenAutoMap
>>> import numpy as np
>>> keys_array = np.arange(10_000)
>>> keys_array.flags.writeable = False
>>> keys_list = keys_array.tolist()
>>> %timeit FrozenAutoMap(keys_list)
272 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit FrozenAutoMap(keys_array)
658 µs ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each

With the changes of branch #16

>>> from automap import FrozenAutoMap
>>> import numpy as np
>>> keys_array = np.arange(10_000)
>>> keys_array.flags.writeable = False
>>> keys_list = keys_array.tolist()
>>> %timeit FrozenAutoMap(keys_list)
276 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit FrozenAutoMap(keys_array)
206 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)