Multi-GPU Unified memory not working

Question

Multi-GPU Unified memory not working

siddmittal opened this issue 6 months ago · comments

Description

I have multi-GPU system. I need to allocate several large arrays. All the arrays cannot fit onto a single GPU. Therefore, I want to use unified memory so that the program can automatically allocate arrays on all available GPUs.

In the code below, I am creating two large arrays a1 and a2. Since the arrays are quite big (30 GB per array) so, obviously they can't fit on a single GPU. My GPU cluster consists of four GPUs, each with 32 GB memory. I want the program to use all available GPUs to allocate a1 and a2 .

Problem: After GPU-1 gets full, the program uses CPU memory instead of using GPU-2/3/4 (which are fully available) .

To Reproduce

    import cupy as cp
    import numpy as np

    # allocate unified memory - TO UTILIZE ALL AVAILABLE GPUs (NOT WORKING !!!)
    pool = cp.cuda.MemoryPool(cp.cuda.malloc_managed)
    cp.cuda.set_allocator(pool.malloc)

    def get_big_array():
        # Desired memory in GB
        desired_memory_gb = 30

        # Calculate the number of elements required to achieve desired memory
        element_size_bytes = np.dtype(np.float64).itemsize
        desired_memory_bytes = desired_memory_gb * (1024**3)  # Convert GB to bytes
        num_elements = desired_memory_bytes // element_size_bytes

        # Create the array with the calculated number of elements
        array = cp.full(num_elements, 1.1, dtype=np.float64)

        return array

    def test_unified_memory_arrays():
        a1 = get_big_array()
        a2 = get_big_array()
        a3 = a1 + a2
        result = cp.all(a3 == 2.2) #cp.asnumpy(a3)
        return result

    ###########-----run program--------########
    res = test_unified_memory_arrays()
    print('finished...')

Installation

Conda-Forge (conda install ...)

Environment

# Paste the output here

Additional Information

Problem: After GPU-1 gets full, the program uses CPU memory instead of using GPU-2/3/4 (which are fully available) .

Kenichi Maehashi · Answer 1 · Fri Jan 05 2024 16:39:02 GMT+0800 (China Standard Time)

use unified memory so that the program can automatically allocate arrays on all available GPUs

This is not a semantics that unified memory provides. See https://developer.nvidia.com/blog/unified-memory-cuda-beginners/.

We are aware of such use-case and working actively on a feature called distributed ndarray, but it is still in the very early development phase.
https://docs.cupy.dev/en/latest/reference/distributed.html#module-cupyx.distributed.array

Leo Fang · Answer 2 · Fri Jan 05 2024 22:56:05 GMT+0800 (China Standard Time)

Indeed, the semantics of managed memory is not that a chunk of data would be physically distributed across GPUs, but that it can be accessed on either host or device. For multiple GPUs, the accessibility guarantee is architecture dependent
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#multi-gpu
and nothing is said about where the data physically locates. It can be anywhere and migrated as needed.

Below is a simple snippet to show what it entails when using managed memory to back a cp.ndarray:

>>> import cupy as cp
>>> pool = cp.cuda.MemoryPool(cp.cuda.malloc_managed)
>>> cp.cuda.set_allocator(pool.malloc)
>>> # allocate 64 GiB of memory on a GPU with only 48 GiB mem
>>> # note that we don't specify any current device, so by default this happens on GPU 0
>>> a = cp.empty(64 * 1024**3, dtype=cp.int8)
>>> a.data.mem.size
68719476736
>>> # this takes a while due to data migration
>>> a[:] = 3  # run on GPU 0
>>> a
array([3, 3, 3, ..., 3, 3, 3], dtype=int8)
>>> with cp.cuda.Device(1):
...     a[:2] = 7  # run on GPU 1
... 
<stdin>:2: PerformanceWarning: The device where the array resides (0) is different from the current device (1). Peer access has been activated automatically.
>>> a[:10]  # run on GPU 0
array([7, 7, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int8)

To ensure data is logically and physically distributed across GPUs, the ongoing work on distributed ndarray will be needed. It's not as simple as just switching the memory resource. Lots of work need to be done.

Sidd · Answer 3 · Fri Jan 05 2024 23:30:52 GMT+0800 (China Standard Time)

@leofang @kmaehashi Thanks for the reply. Seems like distributed arrays are more suitable for my need. I am trying to create a distributed array based on the provided example here.

import cupy as cp
import cupyx

A = cupyx.distributed.array.distributed_array(
    cp.arange(6).reshape(2, 3),
    cupyx.make_2d_index_map([0, 2], [0, 1, 3],
                    [[{0}, {1, 2}]]))


B = cupyx.distributed.array.distributed_array(
    cp.arange(12).reshape(3, 4),
    cupyx.make_2d_index_map([0, 1, 3], [0, 2, 4],
                    [[{0}, {0}],
                    [{1}, {2}]]))

print('done')

but in the above code, I am getting a runtime exception stating that Exception has occurred: AttributeError
module 'cupyx' has no attribute 'distributed'

In which package the distributed_array is defined?

Leo Fang · Answer 4 · Sat Jan 06 2024 01:34:36 GMT+0800 (China Standard Time)

You need to install CuPy v13rc1 for now, see the instruction here: https://github.com/cupy/cupy/releases/tag/v13.0.0rc1