Using Warp with python threading

Question

Using Warp with python threading

BernardoCovas opened this issue a month ago · comments

Is it possible to use warp with python threading? I use python's ThreadPoolExecutor frequently to paralelize binded c++ code, and started using warp also along with it. It seems to work with cuda, as long as the kernels are loaded before thread execution (presumably for kernel registration on global variables). However, CPU warp code does not seem to work so smoothly. It frequently crashes. Will it be possible to use warp code from independent python threads? I presume it would also allow for parallel CPU warp kernels (which currently seem to be sequential).

Thank you in advance

Miles Macklin · Answer 1 · Mon May 27 2024 06:57:42 GMT+0800 (China Standard Time)

Hi @BernardoCovas, thanks for the report - indeed our CPU execution does not currently support multi-threaded execution.

The main reason for this is the use of a static variable (https://github.com/NVIDIA/warp/blob/main/warp/codegen.py#L2401) to hold the thread index during kernel grid iteration. I think we could easily avoid this by updating the CPU path to use a simple code generation approach that mirrors the way the CUDA code works.

I will add a task to look into this.

Bernardo Covas · Answer 2 · Mon May 27 2024 19:04:08 GMT+0800 (China Standard Time)

Hi @mmacklin
I see, thank you for the response!

Eric Shi · Answer 3 · Sat Jun 08 2024 04:42:04 GMT+0800 (China Standard Time)

Hey @BernardoCovas, are you able to provide a self-contained reproducible example for this issue? I'd like to make sure your original issue is fixed.

Eric Shi · Answer 4 · Mon Jun 10 2024 08:11:25 GMT+0800 (China Standard Time)

Also, this this using the new GIL-free Python?

Bernardo Covas · Answer 5 · Tue Jun 11 2024 21:47:54 GMT+0800 (China Standard Time)

Hello @shi-eric , @mmacklin

Somehow, and quite embarisingly so, since posting this, I could not reproduce the issue. It used to happen when some warp processing was done before the main algorithm, inside a thread while other threads were executing.

I re-rested with an older version 1.0.0, and somehow it also did not happen.

As soon as I come across it again, I will reach out. In the meantime, here is an example of what should cause the issue, as I initially understood it:

import typing as T

import warp as wp
import numpy as np

wp.init()


vec3f_1d : T.TypeAlias = wp.array1d(dtype=wp.vec3f) # type: ignore


@wp.func
def fn_compute(
        v0: wp.vec3f,
        v1: wp.vec3f):

    # keep busy
    for _ in range(10):
        v2 = wp.cross(v0, v1)
    return v2


@wp.kernel
def kn_compute(
        verts0: vec3f_1d,
        verts1: vec3f_1d,
        out: vec3f_1d):

    index = wp.tid()
    v0 = verts0[index]
    v1 = verts1[index]

    out[index] = fn_compute(v0, v1)


def test_device(i: int, N: int, device: str):

    verts0  = np.random.uniform(0, 1, (N, 3)).astype(np.float32)
    verts1  = np.random.uniform(0, 1, (N, 3)).astype(np.float32)
    res     = np.cross(verts0, verts1).astype(np.float32)

    # try multiple times
    for _ in range(100):

        out = wp.empty((N,), dtype=wp.vec3f, device=device)
        verts0_ = wp.from_numpy(verts0, dtype=wp.vec3f, device=device)
        verts1_ = wp.from_numpy(verts1, dtype=wp.vec3f, device=device)

        wp.launch(
            kn_compute,
            dim=(N,),
            device=device,
            inputs=[verts0_, verts1_],
            outputs=[out])

        out = out.numpy()

    close = np.isclose(res, out, atol=1e-7)
    close_1d = np.all(close, axis=-1)
    not_close_1d = ~close_1d
    not_close_1d, = np.nonzero(not_close_1d)

    assert np.all(close_1d), f'{out[not_close_1d[0]], res[not_close_1d[0]]}'

    return i, N


if __name__ == '__main__':

    from concurrent import futures
    import random

    thread_pool = futures.ThreadPoolExecutor()

    N_MIN = 1
    N_MAX = 1000000
    M = 100
    DEVICE = 'cuda:0'
    DEVICE = 'cpu'

    # NOTE:(b.covas): Load kernels
    test_device(0, 1, DEVICE)

    # NOTE:(b.covas): running with different dimentions
    fs = {thread_pool.submit(test_device, i, random.randint(N_MIN, N_MAX), DEVICE) for i in range(M)}

    for j, f in enumerate(futures.as_completed(fs)):
        i, n = f.result()
        print(f'{j}/{M}: completed run {i} with {n} elements')

The reason for trying multiple times was to have a situation in which two different threads are iterating through different sized kernels, and expecting a segfault or wrong result. Not only does this work, but also all CPU's are busy. I'm confused, as this was a somewhat long-standing issue that forced the use of cuda devices only.

Bernardo Covas · Answer 6 · Tue Jun 11 2024 21:58:49 GMT+0800 (China Standard Time)

Also, @shi-eric , this is using standard python 3.10

Eric Shi · Answer 7 · Wed Jun 12 2024 01:43:33 GMT+0800 (China Standard Time)

Thanks for the helpful example @BernardoCovas! I was also having problems getting my mental picture of the issue to materialize in an example program to verify my changes. Even after building a GIL-free Python 3.13 to test this issue with a ThreadPoolExecutor, I could not get the intended wrong behavior with the existing approach that uses a static variable when computing the CPU thread index, so there's still some more learning to do.

Bernardo Covas · Answer 8 · Wed Jun 12 2024 02:06:41 GMT+0800 (China Standard Time)

Hi @shi-eric , Thank you for the help! So maybe I am not understanding the current approach. If a static variable is used, it means that warp never releases the GIL while executing kernels? If it would, then two simultaneous kernels would (wrongly) use and increment the same variable. Is this right? I am asking because it seems that all CPU's are performing some work by looking at htop, but somehow the result is still verified

Eric Shi · Answer 9 · Wed Jun 12 2024 02:29:38 GMT+0800 (China Standard Time)

Take this with a grain of salt because as I said in my previous comment, there are still some things that tell me I don't have the full picture yet.

My understanding of the problem is that because a static variable is used in Warp, this gives the wrong behavior when more than one CPU kernel is being executed at the same time. However, my guess is that the Python GIL prevents more than one CPU Warp kernel from running at the same time when using a single Python process and multiple threads.

When running on a special CPython that supports disabling the GIL (upcoming option for Python 3.13) then multiple Warp CPU kernels can execute concurrently when using thread-based parallelism and we should see an issue with thread indexing because of the use of the static thread index.

Warp does not have any code that releases/reacquires the GIL, but numpy does. Maybe you see a speedup from mulithreading despite the GIL because of this?

Bernardo Covas · Answer 10 · Wed Jun 12 2024 02:42:07 GMT+0800 (China Standard Time)

Yes that is likely it. I do use numpy quite a bit, and I do also release the gil in my own codebase basically everywhere, so I use the thread pool precisely for the speedup.
I very much appreciate the help so far. The final ideal case would be to safely allow multiple simultaneous kernels with the gil released to make full use of paralellism.
Thank you so far @shi-eric and @mmacklin

Bernardo Covas · Answer 11 · Wed Jun 12 2024 02:46:37 GMT+0800 (China Standard Time)

After printing the thread index inside the kernel in the beginning and end, I can confirm no two kernels are executing simultaneously, as one allways starts and ends before any other begins @shi-eric

import typing as T

import warp as wp
import numpy as np

wp.init()


vec3f_1d : T.TypeAlias = wp.array1d(dtype=wp.vec3f) # type: ignore


@wp.func
def fn_compute(
        v0: wp.vec3f,
        v1: wp.vec3f):

    # keep busy
    for _ in range(1000):
        v2 = wp.cross(v0, v1)
    return v2


@wp.kernel
def kn_compute(
        verts0: vec3f_1d,
        verts1: vec3f_1d,
        out: vec3f_1d):

    index = wp.tid()
    v0 = verts0[index]
    v1 = verts1[index]

    wp.printf('start thread %d\n', index)
    out[index] = fn_compute(v0, v1)
    wp.printf('end thread %d\n', index)


def test_device(i: int, N: int, device: str):

    verts0  = np.random.uniform(0, 1, (N, 3)).astype(np.float32)
    verts1  = np.random.uniform(0, 1, (N, 3)).astype(np.float32)
    res     = np.cross(verts0, verts1).astype(np.float32)

    # try multiple times
    for _ in range(100):

        out = wp.empty((N,), dtype=wp.vec3f, device=device)
        verts0_ = wp.from_numpy(verts0, dtype=wp.vec3f, device=device)
        verts1_ = wp.from_numpy(verts1, dtype=wp.vec3f, device=device)

        wp.launch(
            kn_compute,
            dim=(N,),
            device=device,
            inputs=[verts0_, verts1_],
            outputs=[out])

        out = out.numpy()

    close = np.isclose(res, out, atol=1e-7)
    close_1d = np.all(close, axis=-1)
    not_close_1d = ~close_1d
    not_close_1d, = np.nonzero(not_close_1d)

    assert np.all(close), f'{out[not_close_1d[0]], res[not_close_1d[0]]}'

    return i, N


if __name__ == '__main__':

    from concurrent import futures
    import random

    thread_pool = futures.ThreadPoolExecutor()

    N_MIN = 1
    N_MAX = 1000000
    M = 100
    DEVICE = 'cuda:0'
    DEVICE = 'cpu'

    # NOTE:(b.covas): Load kernels
    test_device(0, 1, DEVICE)

    # NOTE:(b.covas): running with different dimentions
    fs = {thread_pool.submit(test_device, i, random.randint(N_MIN, N_MAX), DEVICE) for i in range(M)}

    for j, f in enumerate(futures.as_completed(fs)):
        i, n = f.result()
        print(f'{j}/{M}: completed run {i} with {n} elements')