dgilland / pydash

The kitchen sink of Python utility libraries for doing "stuff" in a functional way. Based on the Lo-Dash Javascript library.

Home Page:http://pydash.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pydash.get is slow

av1m opened this issue · comments

I'm very interested in being able to retrieve data from a list/dictionary.
I wanted to have the equivalent of jsonata in Python, but I couldn't find anything.
I thus decided to carry out my own function, then I wondered about the performances of the function.

I thus discovered pydash and pydash.get.
Suffice to say that I find the project amazing, but when I compared pydash.get with the function I had made, I was left shocked. I also put a comparison in this gist.

I test my code in Python 3.10.1 on macOS m1

I took a look at your gist and reproduced similar results locally.

Inspecting pydash.get, I found that your implementation does not handle the same scenarios that pydash does. Some things that pydash.get does differently which makes it slower:

  • Path keys can be like "foo.bar.0" and like "foo.bar[0]".
  • Path key delimiters can be escaped with backslashes like "foo\.bar" to get keys that contain ".".
  • Path keys like "0" will work for both integers and string keys in the target object (e.g. {0: True} and {"0": True} can be accessed with "0" as the path key).
  • In addition to accessing list/dicts, pydash supports namedtuples and class objects (i.e. attribute access).

But ignoring most of the differences, I found that the biggest time sink is in the regular expressions used to by pydash to parse the path keys (i.e. supporting deep access with "items.0"/"items[0]" and backslash-escaping):

pydash/src/pydash/utilities.py

Lines 1265 to 1284 in 24ad0e4

def to_path_tokens(value):
"""Parse `value` into :class:`PathToken` objects."""
if pyd.is_string(value) and ("." in value or "[" in value):
# Since we can't tell whether a bare number is supposed to be dict key or a list index, we
# support a special syntax where any string-integer surrounded by brackets is treated as a
# list index and converted to an integer.
keys = [
PathToken(int(key[1:-1]), default_factory=list)
if RE_PATH_LIST_INDEX.match(key)
else PathToken(unescape_path_key(key), default_factory=dict)
for key in filter(None, RE_PATH_KEY_DELIM.split(value))
]
elif pyd.is_string(value) or pyd.is_number(value):
keys = [PathToken(value, default_factory=dict)]
elif value is UNSET:
keys = []
else:
keys = value
return keys

That bit of code takes up a large percentage of the overall execution time, but there is a way around that to improve performance without changing anything in pydash: use a list of path keys instead of a string. So instead of pydash.get(data, "0.repo.url") it would be pydash.get(data, [0, "repo", "url"]). That bypasses the regular expression evaluations and helps speed things up significantly.

I also have another library that is a more performant with similar features: fnc (but the argument order is different so it would be fnc.get([0, "repo", "url"], data) instead)

If I update your gist to use the following pydash and fnc implementations:

import pydash
import fnc

def get_pydash(data):
    return [pydash.get(data, [i, "repo", "url"]) for i in range(len(data) - 1)]

def get_fnc(data):
    return [fnc.get([i, "repo", "url"], data) for i in range(len(data) - 1)]

Then I get timing profiles like this:

Execution time for getdeep: 5.847658754999999
Execution time for pydash.get: 16.512301255
Execution time for fnc.get: 8.430758072

Still not as fast as your implementation, but not nearly as bad (with fnc being twice as fast as pydash).

Thank you for this comprehensive feedback.

Except for the builtin or it's not possible, I usually compare only using lists or only str.

Regarding the use of a list or a str, I have the same constraint. The list is faster and the str must be parsed into a list...

I think it would be interesting to privilege the use of lists (and to mention it in the documentation that the latter is faster).

A possible improvement for pydash.get would be to check the str and try to minimize parsing time. Because, on a basic operation such as "0.repo.url", it shouldn't have a very complex algorithm.

If it helps, you can use the gist implementation.

+1 for list of path keys. In JS everything is an object and a numeric key can be indexed with a string or a number. Python doesn't and shouldn't work this way. IMO you should deprecate the string behavior and switch it to List[Union[str, int]] and index clearly so pydash.get(obj, ["key", 0, "item_key"]) or pydash.get(obj, ["key", "0", "item_key"]) reflect exactly what is intended.