Error while running "make train_gpt2fp32cu" on Ubuntu

Question

Error while running "make train_gpt2fp32cu" on Ubuntu

arjungupta72 opened this issue 2 years ago · comments

I am trying to train using CUDA but while running make train_gpt2fp32cu
i get this error:
`OpenMP found, compiling with OpenMP support
nvcc found, including CUDA builds
/usr/bin/nvcc -O3 --use_fast_math train_gpt2_fp32.cu -lcublas -lcublasLt -o train_gpt2fp32cu
nvcc fatal : Path to libdevice library not specified
make: *** [Makefile:94: train_gpt2fp32cu] Error 1

`
Clearly the error is not in OpenMP and i have installed cuda as well, the CUDA toolkit installed is v10.1.243
Anyone know how to resolve this?

Coenraad Loubser · Answer 1 · Wed Apr 24 2024 21:38:03 GMT+0800 (China Standard Time)

It might well be a case of the CUDA toolkit being too old, most people are using CUDA 12.2 or 12.3... and we're aiming for 12.4 here.

NB Your CUDA toolchain version needs to match the kernel driver for this, it seems. Otherwise you get:

[CUDA ERROR] at file train_gpt2.cu:977:
the provided PTX was compiled with an unsupported toolchain.

Coenraad Loubser · Answer 2 · Thu Apr 25 2024 00:12:08 GMT+0800 (China Standard Time)

Also, this doesn't even work with 12.1:

train_gpt2.cu(104): error: identifier "__ushort_as_bfloat16" is undefined
      __nv_bfloat162 add_val = (ptr_val & 0x3) ? __halves2bfloat162(__ushort_as_bfloat16(0), val)
                                                                    ^

train_gpt2.cu(104): error: identifier "__halves2bfloat162" is undefined
      __nv_bfloat162 add_val = (ptr_val & 0x3) ? __halves2bfloat162(__ushort_as_bfloat16(0), val)

arjun · Answer 3 · Thu Apr 25 2024 13:22:20 GMT+0800 (China Standard Time)

anyone knows how to modify the code to make it work with older versions of cuda, my gpu is quite old and doesnt support cuda 12, it supports upto cuda 10

Ralph Dratman · Answer 4 · Thu Apr 25 2024 13:23:47 GMT+0800 (China Standard Time)

I also still can't make train_gpt2fp32cu
*** No rule to make target 'train_gpt2fp32cu'. Stop.
-bash: ./train_gpt2fp32cu: No such file or directory
I tried a lot of things but still no luck. Any suggestions appreciated.
Here is some of the output of nvidia-smi:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 Off | Off |
| 61% 83C P2 290W / 300W | 9311MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

Coenraad Loubser · Answer 5 · Thu Apr 25 2024 17:23:53 GMT+0800 (China Standard Time)

That's where the educational part of this project comes in... The correct command to build it is: make If you have `nvcc` in your path, it will spit out errors. You will just have to go over them one by one, read the CUDA docs and changelogs... post here maybe someone can help or offer advice. If you don't have it in the path, make sure you `export PATH=/usr/local/cuda/bin:$PATH` or wherever your `nvcc` is located (`apt install mlocate; sudo updatedb; locate nvcc`) You know that `python -m venv .venv; source .venv/bin/activate; pip install -r requirements.txt` will set up a python environment for you, and that `python train_gpt2.py` will also do the same, right? So that's a good benchmark, and that should work - because I assume Pytorch supports Cuda 10... if not, you can let it run on CPU. I think getting older GPU's to work is worthwhile, because they are also much faster than CPU, and there are a lot of them out there. But maybe `tinygrad` will already work for your GPU? Try it and let us (and them) know!

Ralph Dratman · Answer 6 · Fri Apr 26 2024 06:41:01 GMT+0800 (China Standard Time)

As you can see from what I sent, I have CUDA Version: 12.3, but what you can't see is that CUDA Version 12.4 is also installed. I did not mean to install both at once, but the CUDA installation instructions are scattered all over the place and hence not clear. Maybe I shoud remove 12.3? Advice appreciated.

Coenraad Loubser · Answer 7 · Fri Apr 26 2024 17:35:59 GMT+0800 (China Standard Time)

Yes, just run make.

train_gpt2fp32cu isn't a target. Read the Makefile.

zerok · Answer 8 · Fri Apr 26 2024 22:40:10 GMT+0800 (China Standard Time)

I encountered a similar issue while compiling with CUDA 11.

/usr/local/cuda/bin/nvcc -O3 --use_fast_math train_gpt2.cu -lcublas -lcublasLt -o train_gpt2cu
train_gpt2.cu(105): error: identifier "__ushort_as_bfloat16" is undefined

train_gpt2.cu(105): error: identifier "__halves2bfloat162" is undefined

train_gpt2.cu(107): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (__nv_bfloat162 *, __nv_bfloat162)

train_gpt2.cu(407): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)

train_gpt2.cu(408): error: no instance of overloaded function "__ldcs" matches the argument list

Coenraad Loubser · Answer 9 · Fri Apr 26 2024 23:35:17 GMT+0800 (China Standard Time)

Next step:

Look at train_gpt2.cu at around line 105...

See if you can rewrite the function to use a different type, which doesn't seem to be defined on Cuda 10... so maybe you can make it work if you can create an atomicAdd that works without __ushort_as_bfloat16 and __halves2bfloat162... seems doable.

Not sure about what __ldcs does... have a look? See if you can figure it out?

Coenraad Loubser · Answer 10 · Fri Apr 26 2024 23:47:05 GMT+0800 (China Standard Time)

Try this at line 97:

#define cublasCheck(status) { cublasCheck((status), __FILE__, __LINE__); }
    // GPU helper functions for atomicAdd on smaller than 32-bit types
    __device__ void atomicAddX(__nv_bfloat16* addr, __nv_bfloat16 val) {
        uintptr_t ptr_val = reinterpret_cast<uintptr_t>(addr);
        __nv_bfloat162* ptr_bf16 = reinterpret_cast<__nv_bfloat162*>(ptr_val & ~uintptr_t(0x3));

    union {
        unsigned int u;
        float f;
    } conv;

    conv.f = val;
    unsigned short bf16_val = (conv.u >> 16) & 0xFFFF;

    __nv_bfloat162 add_val;
    if (ptr_val & 0x3) {
        add_val.x = __float2bfloat16(0);
        add_val.y = __float2bfloat16(conv.f);
    } else {
        add_val.x = __float2bfloat16(conv.f);
        add_val.y = __float2bfloat16(0);
    }

    __nv_bfloat162 add_val;
    add_val.x = (ptr_val & 0x3) ? bf16_val_y : bf16_val_x;
    add_val.y = (ptr_val & 0x3) ? bf16_val_x : bf16_val_y;

    // Prepare the value to add, setting the other half to zero
//    __nv_bfloat162 add_val = (ptr_val & 0x3) ? __halves2bfloat162(__ushort_as_bfloat16(0), val)
//                                            : __halves2bfloat162(val, __ushort_as_bfloat16(0));
    atomicAdd(ptr_bf16, add_val);
}

As for __ldcs...

From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html:

7.10. Read-Only Data Cache Load Function[](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#read-only-data-cache-load-function)

The read-only data cache load function is only supported by devices of compute capability 5.0 and higher.

T __ldg(const T* address);

returns the data of type T located at address address, where T is char, signed char, short, int, long, long longunsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, char2, char4, short2, short4, int2, int4, longlong2uchar2, uchar4, ushort2, ushort4, uint2, uint4, ulonglong2float, float2, float4, double, or double2. With the cuda_fp16.h header included, T can be __half or __half2. Similarly, with the cuda_bf16.h header included, T can also be __nv_bfloat16 or __nv_bfloat162. The operation is cached in the read-only data cache (see [Global Memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x)).
7.11. Load Functions Using Cache Hints[](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#load-functions-using-cache-hints)

These load functions are only supported by devices of compute capability 5.0 and higher.

T __ldcg(const T* address);
T __ldca(const T* address);
T __ldcs(const T* address);
T __ldlu(const T* address);
T __ldcv(const T* address);

returns the data of type T located at address address, where T is char, signed char, short, int, long, long longunsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, char2, char4, short2, short4, int2, int4, longlong2uchar2, uchar4, ushort2, ushort4, uint2, uint4, ulonglong2float, float2, float4, double, or double2. With the cuda_fp16.h header included, T can be __half or __half2. Similarly, with the cuda_bf16.h header included, T can also be __nv_bfloat16 or __nv_bfloat162. The operation is using the corresponding cache operator (see [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators))

Looks like you can just change all __ldcs with __ldg?

EDIT: Rename the issue to "Error while running "make" for train_gpt2fp32cu on Cuda 10"

zerok · Answer 11 · Sat Apr 27 2024 12:12:32 GMT+0800 (China Standard Time)

@dagelf thanks! I updated CUDA to version 12.3.107, compiled successfully, and it runs normally.

Coenraad Loubser · Answer 12 · Tue Apr 30 2024 23:38:45 GMT+0800 (China Standard Time)

Still awaiting confirmation that the above changes make it run on Cuda 10... which might be useful if there are any older Nvidia cards that don't work with Cuda 12... are there?

This issue should be named "Error while running "make" for train_gpt2fp32cu on Cuda 10"

melon · Answer 13 · Sun May 05 2024 18:44:17 GMT+0800 (China Standard Time)

FYI, on ubuntu 2004 / CUDA 11.2 / cuDNN 8 / GPU RTX 4090D 24G got this err when try to run:

USE_CUDNN=1 make train_gpt2cu

the err print as:

---------------------------------------------
✓ cuDNN found, will run with flash-attention
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -O3 -t=0 --use_fast_math -DENABLE_CUDNN -DMULTI_GPU -DENABLE_BF16 train_gpt2.cu -lcublas -lcublasLt -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/  -I/root/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include  -lmpi -lnccl -lcublas -lcublasLt -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -o train_gpt2cu
/root/cudnn-frontend/include/cudnn_frontend_utils.h(96): error: namespace "std" has no member "variant"
...

/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1470): error: identifier "is_inference" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1482): error: identifier "attn_scale_value" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1526): error: identifier "dropout_probability" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: qualified name is not allowed
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: this declaration has no storage class or type specifier
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: expected a ";"
...
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1584): error: identifier "is_inference" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1596): error: identifier "attn_scale_value" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1616): error: qualified name is not allowed
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1616): error: this declaration has no storage class or type specifier

Error limit reached.
100 errors detected in the compilation of "train_gpt2.cu".
Compilation terminated.

from the above info, I have to update CUDA to at least 12.3.107 for a try and update my cuDNN version accordingly?
or Is it ok to simply update CUDA and make train_gpt2cu without cuDNN support?

use same version of CUDA, cuDNN but try make with without cuDNN support, still got err

make train_gpt2cu

---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -O3 -t=0 --use_fast_math -DMULTI_GPU -DENABLE_BF16 train_gpt2.cu -lcublas -lcublasLt -L/usr/lib/x86_64-linux-gnu/openmpi/lib/  -I/usr/lib/x86_64-linux-gnu/openmpi/include  -lmpi -lnccl -lcublas -lcublasLt -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -o train_gpt2cu
train_gpt2.cu(212): error: identifier "__ushort_as_bfloat16" is undefined

train_gpt2.cu(212): error: identifier "__halves2bfloat162" is undefined

train_gpt2.cu(214): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (__nv_bfloat162 *, __nv_bfloat162)

train_gpt2.cu(253): error: no operator "+=" matches these operands
            operand types are: floatX += float

train_gpt2.cu(267): warning #20012-D: __device__ annotation is ignored on a function("Packed128") that is explicitly defaulted on its first declaration
...
train_gpt2.cu(1348): error: no operator "+=" matches these operands
            operand types are: floatX += floatX

train_gpt2.cu(82): warning #177-D: variable "ncclFloatN" was declared but never referenced

20 errors detected in the compilation of "train_gpt2.cu".

I'll try update CUDA and update here if the version of it is to blamed with.

fromeast · Answer 14 · Mon May 06 2024 10:50:07 GMT+0800 (China Standard Time)

FYI, on ubuntu 2004 / CUDA 11.8 / GPU RTX 4090D 24G got this err when try to run:

USE_CUDNN=1 make train_gpt2cu

the err print as:

---------------------------------------------
✓ cuDNN found, will run with flash-attention
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -O3 -t=0 --use_fast_math -DENABLE_CUDNN -DMULTI_GPU -DENABLE_BF16 train_gpt2.cu -lcublas -lcublasLt -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/  -I/root/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include  -lmpi -lnccl -lcublas -lcublasLt -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -o train_gpt2cu
/root/cudnn-frontend/include/cudnn_frontend_utils.h(96): error: namespace "std" has no member "variant"
...

/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1470): error: identifier "is_inference" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1482): error: identifier "attn_scale_value" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1526): error: identifier "dropout_probability" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: qualified name is not allowed
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: this declaration has no storage class or type specifier
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: expected a ";"
...
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1584): error: identifier "is_inference" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1596): error: identifier "attn_scale_value" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1616): error: qualified name is not allowed
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1616): error: this declaration has no storage class or type specifier

Error limit reached.
100 errors detected in the compilation of "train_gpt2.cu".
Compilation terminated.

from the above info, I have to update CUDA to at least 12.3.107 for a try and update my cuDNN version accordingly? or Is it ok to simply update CUDA and make train_gpt2cu without cuDNN support?

same ERROR

Coenraad Loubser · Answer 15 · Tue May 07 2024 22:09:06 GMT+0800 (China Standard Time)

Does it work without USE_CUDNN?