pick is quite slow.

Question

pick is quite slow.

ChristerNilsson opened this issue 3 years ago · comments

My small function below is 1000 times faster:

def pick(row,cols):
	result = {}
	for col in cols:
		result[col] = row[col]
	return result

row is an array of 121400 dicts.
Each dict has 14 key-value pairs.
My timing was 279 seconds to be compared with 0.277 secs.
I tried to find the bottleneck in the pydash code, but got lost.
(Windows 10, python 3.7)

Derrick Gilland · Answer 1 · Fri Oct 15 2021 07:41:46 GMT+0800 (China Standard Time)

Pick is typically used against an object like a dictionary, but you say that you're passing an array of dictionaries to it.

Do you mean that you're calling pick on each dictionary in an array of 1.2 million dictionaries or that you passed an array of dictionaries to pick?

Christer Nilsson · Answer 2 · Fri Oct 15 2021 07:48:59 GMT+0800 (China Standard Time)

Yes.

My structure is [{}..{}]

rows = []
rows.append({'a':1, ... 'z':14}) # 121400 of these

result = []
for row in rows:
  result.append(pick(row,['a'..'z']))

Calling pick 121400 times

Derrick Gilland · Answer 3 · Fri Oct 15 2021 10:22:34 GMT+0800 (China Standard Time)

The reason pick is so slow is because the underlying function that computes the result supports a list of keys (e.g. ['a', 'b', 'c'], keys that refer to nested objects (e.g. ['a.b.c', 'a.b.d', 'a.b.e']), and predicate functions that can return True/False for whether the key should be picked or not.

The biggest performance hit comes from the function that's used to build the final result that needs to support creating nested objects (for when deep paths are specified). Unfortunately for performance, both the shallow and deep path cases are handled by the same setter which is the main cause of the issue.

pick used to not support deep key paths and had an implementation similar to the basic use-case you described. To get that kind of performance back while still support deep paths, though, there would probably need to be some short circuiting to allow shallow-pick to fall through to the faster implementation.

Christer Nilsson · Answer 4 · Fri Oct 15 2021 17:55:15 GMT+0800 (China Standard Time)

Thanks for your answer!

Maybe the documentation should mention a.b.c, and also how to get performance doing an alternative. I haven't analyzed how lodash is doing, maybe they have the same issue.

Christer Nilsson · Answer 5 · Sat Oct 16 2021 23:07:31 GMT+0800 (China Standard Time)

I tried to do the same thing with javascript and lodash. 121400x14 cells.
lodash.pick took 1000 ms and my explicit loop took 150ms.
lodash.pick has the same path handling as pydash.pick, e.g. ['a.b.c'].
This indicates the pydash code consumes too much execution time.

https://observablehq.com/@christer/test-av-pick

Derrick Gilland · Answer 6 · Sun Oct 17 2021 00:36:33 GMT+0800 (China Standard Time)

Yes, the pydash.pick implementation isn't optimized for performance right now. I have never been very satisfied with how I implemented the deep-object building parts of the code.

But if you want a Python library that is similar to pydash but is leaner with better performance, check out my other library fnc. It doesn't have all the same functions as pydash but it does have most of the core functionality. However, the call signatures are different than pydash (typically the object is the last argument instead of the first) and it's generator based for most of the sequence functions. But it should yield better performance over pydash (and if not then it's a lot easier for me to optimize fnc than pydash in most cases).

I went ahead and did some timings with your base pick implementation compared to fnc with the following:

import fnc

def base_pick(obj, cols):
    result = {}
    for col in cols:
        result[col] = obj[col]
    return result

def base_map_pick(items, cols):
    return [base_pick(item, cols) for item in items]

def fnc_map_pick(items, cols_set):
    # NOTE: cols must be a set for fnc.map to return picked objects.
    return list(fnc.map(cols_set, items))

def timeit(label, func, *args):
    start_time = time.time()
    func(*args)
    end_time = time.time()
    print(f"{end_time - start_time:.4f}s", label)

count = 121400
items = [{"a": i} for i in range(count)]

timeit('base', base_map_pick, items, ["a"])
timeit('fnc', fnc_map_pick, items, set(["a"]))

The timings for that looked like:

0.0705s base
0.1300s fnc

So fnc took about 2x longer than the base case which is a far cry from pydash.pick which is like a gazillion times slower 🙃