jnp.argsort much slower than the numpy version

Question

jnp.argsort much slower than the numpy version

fbartolic opened this issue 2 years ago · comments

Here's a comparison of the JAX and numpy versions of argsort on a CPU:

import numpy as np
import jax.numpy as jnp
from jax import config, random
config.update('jax_platform_name', 'cpu')

key = random.PRNGKey(42)
key, subkey = random.split(key)

x_jnp = random.uniform(subkey, (100, 10000))
x_np = np.array(x_jnp)

%%timeit
np.argsort(x_np, axis=0)

%%timeit
jnp.argsort(x_jnp, axis=0).block_until_ready()

In this case jnp.argsort is ~5X slower than than np.argsort. I'm seeing >20x difference with more realistic arrays. Why is there such a large difference in performance between the two implementations?

Jake Vanderplas · Answer 1 · Sun Apr 24 2022 09:04:16 GMT+0800 (China Standard Time)

You might find this FAQ helpful: FAQ: Is JAX Faster Than NumPy?.

Fran Bartolić · Answer 2 · Sun Apr 24 2022 18:23:34 GMT+0800 (China Standard Time)

Thanks! I read the FAQ but I didn't expect that that the difference in performance can get so large.

You Jiacheng · Answer 3 · Mon Apr 25 2022 19:03:41 GMT+0800 (China Standard Time)

@jakevdp It seems that it is a pure computational efficiency problem of sort primitive on CPU.
I find that the sort primitive performance on GPU is satisfactory, and sort primitive share the same ~~translation rule~~ mlir lowering on all platform. Maybe XLA use a parallelism friendly sort algorithm which is inefficient on CPU.

import numpy as np
import jax.numpy as jnp
from jax import config, random
config.update('jax_platform_name', 'cpu')

key = random.PRNGKey(42)
key, subkey = random.split(key)

x_jnp = random.uniform(subkey, (1000000,))
x_np = np.array(x_jnp)

jnp.argsort(x_jnp, axis=0).block_until_ready() # compile
jnp.sort(x_jnp, axis=0).block_until_ready() # compile
from timeit import timeit
print(timeit('np.argsort(x_np, axis=0)', globals=globals(), number=10)) # 1.1s
print(timeit('jnp.argsort(x_jnp, axis=0).block_until_ready()', globals=globals(), number=10)) # 4.2s
print(timeit('jnp.sort(x_jnp, axis=0).block_until_ready()', globals=globals(), number=10)) # 3.7s

Jake Vanderplas · Answer 4 · Mon Apr 25 2022 22:15:17 GMT+0800 (China Standard Time)

Yes, in general the XLA project has put much less effort into optimizing operations on CPU than on other backends.

Peter Hawkins · Answer 5 · Mon Apr 25 2022 22:39:18 GMT+0800 (China Standard Time)

I also note that the slowness is specific to floating-point values. Sorting int32 values is significantly faster. The only difference between the two as far as I can tell is the comparison function.

Sjoerd de Vries · Answer 6 · Wed Jun 07 2023 16:08:33 GMT+0800 (China Standard Time)

Running into the same issue, I created a workaround, where argsort is run under Numpy if there is only the CPU.

https://gist.github.com/sjdv1982/803695055c78b62e5d5dc92a004efa77

It seems to be compatible with jax.grad~~, but only after disabling a certain assertion in the JAX code~~.

I am a beginner in JAX, criticism is welcome, use with care.

Jake Vanderplas · Answer 7 · Thu Jun 08 2023 17:08:02 GMT+0800 (China Standard Time)

That's a nice solution! To make it as compatible as possible with JAX transformations, I'd suggest doing the call to numpy via pure_callback instead.

Sjoerd de Vries · Answer 8 · Thu Jun 08 2023 20:21:51 GMT+0800 (China Standard Time)

Thank you! I didn't know pure_callback, I have updated the gist as you suggested. It runs under unmodified JAX now.

I am glad to see that when calling jax.value_and_grads, there are three identical calls into the function, but JAX is smart enough to coalesce them into one.