Aluminium compilation error
mmathys opened this issue · comments
Describe the bug
I get a compilation error related to NCCL in the Aluminium third party module. What is a possible fix?
I installed the dependencies according to the tutorial (see exact commands below). I also initialized the Bagua core module (not in the documentation, advised by Yafen).
Environment
- Your operating system and version: Ubuntu 20.04
- Your python version: 3.8.10
- Your PyTorch version: not applicable
- How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: Preinstalled in AWS AMI
- Have you tried using latest bagua master (
python3 -m pip install --pre bagua
)?: not applicable
Reproducing
Starting point: Ubuntu 20.04, with CUDA 11.03 (nvcc --version
: 11.3)
Run setup script:
#!/bin/sh
cd /home/ubuntu
source .bashrc
set -eux
exit_and_error() {
echo "Auto installation is supported only on Ubuntu(18.04) or CentOs(7,8), abort."
exit
}
check_os_version() {
OS_NAME=$(grep ^NAME /etc/os-release | awk -F'"' '{print $2}')
VERSION_ID=$(grep ^VERSION_ID /etc/os-release | awk -F'"' '{print $2}')
echo "Current OS is "${OS_NAME}", Version is "${VERSION_ID}
if [ "$OS_NAME" == "Ubuntu" ]; then
if [[ $VERSION_ID != @("18.04"|"20.04") ]]; then
exit_and_error
fi
elif [ "$OS_NAME" == "CentOS Linux" ]; then
if [[ $VERSION_ID != @("7"|"8") ]]; then
exit_and_error
fi
else
exit_and_error
fi
}
# upgrade to python3.8
confirm() {
# call with a prompt string or use a default
echo "Your Python version is $(python3 -V), but Bagua requires Python version >= 3.7."
read -r -p "${1:-Do you want to upgrade Python? [Y/n]} " response
case "$response" in
[yY][eE][sS] | [yY])
echo "True"
;;
*)
echo "False"
;;
esac
}
upgrade_python() {
if [ "$OS_NAME" == "Ubuntu" ]; then
sudo apt-get install -y python3.8 python3.8-distutils python3.8-dev
elif [ "$OS_NAME" == "CentOS Linux" ]; then
mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp wget https://www.python.org/ftp/python/3.8.12/Python-3.8.12.tgz &&
tar xvf /var/tmp/Python-3.8.12.tgz &&
cd /var/tmp/Python-3.8.12 && ./configure --enable-optimizations --prefix=/usr && make altinstall &&
rm -rf /var/tmp/Python-3.8.12.tgz /var/tmp/Python-3.8.12 && cd -
fi
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
}
check_python_version() {
PYTHON_VERSION_OK=$(python3 -c 'import sys; print(int(sys.version_info > (3, 7)))')
if [[ $PYTHON_VERSION_OK -eq 0 ]]; then
confirm && upgrade_python
fi
}
check_os_version
# install necessary packages
if [ "$OS_NAME" == "Ubuntu" ]; then
# remove cmake
sudo apt remove --purge --auto-remove -y cmake
# install python3-pip
sudo apt-get update && DEBIAN_FRONTEND=noninteractive sudo apt-get install -y curl software-properties-common wget
sudo apt-get update && DEBIAN_FRONTEND=noninteractive sudo apt-get install -y python3-pip zlib1g-dev libssl-dev
elif [ "$OS_NAME" == "CentOS Linux" ]; then
if [ $VERSION_ID == "7" ]; then
yum remove cmake3 -y
elif [ $VERSION_ID == "8" ]; then
yum remove cmake -y
fi
yum install -y wget curl bzip2 perl zlib-devel openssl-devel
fi
check_python_version
# install some utils
python3 -m pip install --upgrade pip -i https://pypi.org/simple
python3 -m pip install setuptools-rust colorama tqdm wheel -i https://pypi.org/simple
# install cmake 3.22.1
mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp https://github.com/Kitware/CMake/releases/download/v3.22.1/cmake-3.22.1-linux-x86_64.sh &&
cd /var/tmp && chmod +x cmake-3.22.1-linux-x86_64.sh &&
sudo sh cmake-3.22.1-linux-x86_64.sh --prefix=/usr --skip-license &&
rm -rf /var/tmp/cmake-3.22.1-linux-x86_64.sh && cd -
# install hwloc 2.7.0
mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp https://download.open-mpi.org/release/hwloc/v2.7/hwloc-2.7.0.tar.bz2 &&
tar -x -f /var/tmp/hwloc-2.7.0.tar.bz2 -C /var/tmp -j &&
cd /var/tmp/hwloc-2.7.0 && ./configure &&
make -j$(nproc) &&
sudo make -j$(nproc) install &&
rm -rf /var/tmp/hwloc* && cd -
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/local/lib
# install openmpi 4.1.2
mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.bz2 &&
tar -x -f /var/tmp/openmpi-4.1.2.tar.bz2 -C /var/tmp -j &&
cd /var/tmp/openmpi-4.1.2 && ./configure --disable-getpwuid --disable-oshmem --enable-fortran --enable-mca-no-build=btl-uct --enable-orterun-prefix-by-default --with-cuda --without-verbs &&
make -j$(nproc) &&
sudo make -j$(nproc) install &&
rm -rf /var/tmp/openmpi-4.1.2 /var/tmp/openmpi-4.1.2.tar.bz2 && cd -
# install rust
if ! command -v cargo &>/dev/null; then
curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain stable -y
export PATH="$HOME/.cargo/bin:$PATH"
fi
# clone repository
git clone --recurse-submodules https://github.com/BaguaSys/bagua.git
# install bagua core dependencies
cd bagua/bagua_core
python3 bagua_install_deps.py
# set required flags
export BAGUA_NO_INSTALL_DEPS=1
export LIBRARY_PATH="~/.local/share/bagua/nccl/lib:$LIBRARY_PATH"
cd ..
# execute setup
python3 setup.py install --user
This gives me the following error:
Running `/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-565ed2c99881c59f/build-script-build`
The following warnings were emitted during compilation:
warning: nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
warning: nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
error: failed to run custom build command for `bagua-core-internal v0.1.2 (/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal)`
Caused by:
process didn't exit successfully: `/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-565ed2c99881c59f/build-script-build` (exit status: 101)
--- stdout
TARGET = Some("x86_64-unknown-linux-gnu")
OPT_LEVEL = Some("3")
HOST = Some("x86_64-unknown-linux-gnu")
CXX_x86_64-unknown-linux-gnu = None
CXX_x86_64_unknown_linux_gnu = None
HOST_CXX = None
CXX = None
NVCC_x86_64-unknown-linux-gnu = None
NVCC_x86_64_unknown_linux_gnu = None
HOST_NVCC = None
NVCC = None
CXXFLAGS_x86_64-unknown-linux-gnu = None
CXXFLAGS_x86_64_unknown_linux_gnu = None
HOST_CXXFLAGS = None
CXXFLAGS = None
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
running: "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-I" "cpp/include" "-I" "third_party/cub-1.8.0" "-I" "/home/ubuntu/.local/share/bagua/nccl/include" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-std=c++14" "-cudart=shared" "-gencode" "arch=compute_35,code=sm_35" "-gencode" "arch=compute_37,code=sm_37" "-gencode" "arch=compute_50,code=sm_50" "-gencode" "arch=compute_52,code=sm_52" "-gencode" "arch=compute_53,code=sm_53" "-gencode" "arch=compute_60,code=sm_60" "-gencode" "arch=compute_61,code=sm_61" "-gencode" "arch=compute_62,code=sm_62" "-gencode" "arch=compute_70,code=sm_70" "-gencode" "arch=compute_72,code=sm_72" "-gencode" "arch=compute_75,code=sm_75" "-gencode" "arch=compute_80,code=sm_80" "-gencode" "arch=compute_86,code=sm_86" "-o" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/kernels/bagua_kernels.o" "-c" "kernels/bagua_kernels.cu"
cargo:warning=nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
exit status: 0
AR_x86_64-unknown-linux-gnu = None
AR_x86_64_unknown_linux_gnu = None
HOST_AR = None
AR = None
running: "ar" "cq" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/libbagua_kernels.a" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/kernels/bagua_kernels.o"
exit status: 0
running: "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-I" "cpp/include" "-I" "third_party/cub-1.8.0" "-I" "/home/ubuntu/.local/share/bagua/nccl/include" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-std=c++14" "-cudart=shared" "-gencode" "arch=compute_35,code=sm_35" "-gencode" "arch=compute_37,code=sm_37" "-gencode" "arch=compute_50,code=sm_50" "-gencode" "arch=compute_52,code=sm_52" "-gencode" "arch=compute_53,code=sm_53" "-gencode" "arch=compute_60,code=sm_60" "-gencode" "arch=compute_61,code=sm_61" "-gencode" "arch=compute_62,code=sm_62" "-gencode" "arch=compute_70,code=sm_70" "-gencode" "arch=compute_72,code=sm_72" "-gencode" "arch=compute_75,code=sm_75" "-gencode" "arch=compute_80,code=sm_80" "-gencode" "arch=compute_86,code=sm_86" "--device-link" "-o" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/bagua_kernels_dlink.o" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/libbagua_kernels.a"
cargo:warning=nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
exit status: 0
running: "ar" "cq" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/libbagua_kernels.a" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/bagua_kernels_dlink.o"
exit status: 0
running: "ar" "s" "/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out/libbagua_kernels.a"
exit status: 0
cargo:rustc-link-lib=static=bagua_kernels
cargo:rustc-link-search=native=/home/ubuntu/bagua/rust/bagua-core/target/release/build/bagua-core-internal-1d332a94b72bf288/out
CXXSTDLIB_x86_64-unknown-linux-gnu = None
CXXSTDLIB_x86_64_unknown_linux_gnu = None
HOST_CXXSTDLIB = None
CXXSTDLIB = None
cargo:rustc-link-lib=stdc++
cargo:rustc-link-search=native=/usr/local/cuda/bin/../targets/x86_64-linux/lib
cargo:rustc-link-lib=cudart_static
CMAKE_TOOLCHAIN_FILE_x86_64-unknown-linux-gnu = None
CMAKE_TOOLCHAIN_FILE_x86_64_unknown_linux_gnu = None
HOST_CMAKE_TOOLCHAIN_FILE = None
CMAKE_TOOLCHAIN_FILE = None
CMAKE_GENERATOR_x86_64-unknown-linux-gnu = None
CMAKE_GENERATOR_x86_64_unknown_linux_gnu = None
HOST_CMAKE_GENERATOR = None
CMAKE_GENERATOR = None
CMAKE_PREFIX_PATH_x86_64-unknown-linux-gnu = None
CMAKE_PREFIX_PATH_x86_64_unknown_linux_gnu = None
HOST_CMAKE_PREFIX_PATH = None
CMAKE_PREFIX_PATH = None
CMAKE_x86_64-unknown-linux-gnu = None
CMAKE_x86_64_unknown_linux_gnu = None
HOST_CMAKE = None
CMAKE = None
running: "cmake" "/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/third_party/Aluminum" "-DCMAKE_CXX_STANDARD=17" "-DALUMINUM_ENABLE_NCCL=YES" "-DCUB_INCLUDE_PATH=/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/third_party/cub-1.8.0" "-DNCCL_LIBRARY=/home/ubuntu/.local/share/bagua/nccl/lib/libnccl.so" "-DNCCL_INCLUDE_PATH=/home/ubuntu/.local/share/bagua/nccl/include" "-DBUILD_SHARED_LIBS=off" "-DCMAKE_INSTALL_PREFIX=/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/../../../bagua_core/.data" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_C_COMPILER=/usr/bin/cc" "-DCMAKE_CXX_FLAGS= -std=c++17 -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_COMPILER=/usr/bin/c++" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/usr/bin/cc" "-DCMAKE_BUILD_TYPE=Release"
--- stderr
thread 'main' panicked at '
failed to execute command: No such file or directory (os error 2)
is `cmake` not installed?
build script failed, must exit now', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/cmake-0.1.48/src/lib.rs:975:5
stack backtrace:
0: rust_begin_unwind
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
1: core::panicking::panic_fmt
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
2: cmake::fail
3: cmake::run
4: cmake::Config::build
5: build_script_build::main
6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
error: cargo failed with code: 101
Cmake is actually installed (cmake --version
works). Have a look at the command:
"cmake" "/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/third_party/Aluminum" "-DCMAKE_CXX_STANDARD=17" "-DALUMINUM_ENABLE_NCCL=YES" "-DCUB_INCLUDE_PATH=/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/third_party/cub-1.8.0" "-DNCCL_LIBRARY=/home/ubuntu/.local/share/bagua/nccl/lib/libnccl.so" "-DNCCL_INCLUDE_PATH=/home/ubuntu/.local/share/bagua/nccl/include" "-DBUILD_SHARED_LIBS=off" "-DCMAKE_INSTALL_PREFIX=/home/ubuntu/bagua/rust/bagua-core/bagua-core-internal/../../../bagua_core/.data" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_C_COMPILER=/usr/bin/cc" "-DCMAKE_CXX_FLAGS= -std=c++17 -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_COMPILER=/usr/bin/c++" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/usr/bin/cc" "-DCMAKE_BUILD_TYPE=Release"
Please let me know if this error is reproducible. I'd appreciate any tips on how to fix this.
Thanks,
Max
Can you find out where your cmake is installed? Maybe you need to add its path to the environment variable. For example, the binary file cmake is in this directory: /usr/local/bin/
. Then execute the following command to add the path to the environment variable:
export PATH=/usr/local/bin:$PATH