karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

Makefile incorrectly finds that `nccl` is installed for Linux systems with `libvncclclient`

leiDnedyA opened this issue · comments

OS: Ubuntu 22.04.5 LTS

Hi all, I was running the Makefile for the first time, but found that it was failing with this message:

---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✗ MPI not found
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DMULTI_GPU train_gpt2_fp32.cu -lcublas -lcublasLt -lnvidia-ml  -lnccl -o train_gpt2fp32cu
train_gpt2_fp32.cu(62): warning #550-D: variable "cublas_compute_type" was set but never used

/usr/bin/ld: cannot find -lnccl: No such file or directory
collect2: error: ld returned 1 exit status
make: *** [Makefile:277: train_gpt2fp32cu] Error 255

It turns out, the makefile is using the following grep of a dpkg -l call to check if nccl is installed. This gives a false positive if the dpkg prints out any package with the substring nccl, such as "libvncclient1", in my case. Here's the actual code causing the issue:

# Check if NCCL is available, include if so, for multi-GPU training
ifeq ($(NO_MULTI_GPU), 1)
  $(info → Multi-GPU (NCCL) is manually disabled)
else
  ifneq ($(OS), Windows_NT)
    # Detect if running on macOS or Linux
    ifeq ($(SHELL_UNAME), Darwin)
      $(info ✗ Multi-GPU on CUDA on Darwin is not supported, skipping NCCL support)
+     else ifeq ($(shell dpkg -l | grep -q nccl && echo "exists"), exists)
      $(info ✓ NCCL found, OK to train with multiple GPUs)
      NVCC_FLAGS += -DMULTI_GPU
      NVCC_LDLIBS += -lnccl
    else
      $(info ✗ NCCL is not found, disabling multi-GPU support)
      $(info ---> On Linux you can try install NCCL with `sudo apt install libnccl2 libnccl-dev`)
    endif
  endif
endif

If I have some free time I think this would be a fun first issue and I'd be glad to contribute, but if anyone knows the fix off of the top of their head, that would be nice as well!

I have seen the same issue with the fresh ubuntu install (22.04.5 LTS). Glad that someone has fixed it first.