Reduce wheel package size for faiss-gpu CUDA 11.0 build

Question

Reduce wheel package size for faiss-gpu CUDA 11.0 build

kyamagu opened this issue 2 years ago · comments

The CUDA 11.0 build in #56 bloats the wheel package size from 85.5 MB to 216.5 MB. Needs to investigate file size reduction.

Kota Yamaguchi · Answer 1 · Mon Apr 11 2022 13:40:08 GMT+0800 (China Standard Time)

Relevant pytorch/pytorch#56055

Kota Yamaguchi · Answer 2 · Mon Apr 11 2022 22:27:29 GMT+0800 (China Standard Time)

Seems one approach is to drop architecture-specific binary in CUDA libraries via nvprune, like this:

nvprune \
  -gencode arch=compute_60,code=sm_60 \
  -gencode arch=compute_70,code=sm_70 \
  -gencode arch=compute_75,code=sm_75 \
  -gencode arch=compute_80,code=sm_80 \
  -gencode arch=compute_80,code=compute_80 \
  -o /usr/local/cuda/lib64/libcublas_static_slim.a \
  /usr/local/cuda/lib64/libcublas_static.a

Currently there are four dependencies, and applying nvprune slightly reduces the binary size.

libcublas_static.a
libcublasLt_static.a
libcudart_static.a
libculibos.a

In Python 3.9, the original file size of _swigfaiss.cpython-39-x86_64-linux-gnu.so was 341MB, while applying nvprune to all the static libs results in 310MB. This is still huge.

Kota Yamaguchi · Answer 3 · Mon Apr 11 2022 22:32:21 GMT+0800 (China Standard Time)

The major problem is that CUDA 11.0 splits cublasLt API into a different static lib, and that seems to significantly increase the final binary size. In CUDA 10.x, cublasLt API was within the single static lib.

libcublasLt_static.a 224M
libcublas_static.a 82M
libcudart_static.a 910K
libculibos.a 31K

Kota Yamaguchi · Answer 4 · Mon Apr 11 2022 22:43:28 GMT+0800 (China Standard Time)

Strangely, faiss does not use cublasLt API. But when omitting -lcublasLt_static in the linker flag of setup.py, we see the following error on import. Why does that happen?

ImportError: /workspace/faiss-wheels/build/lib.linux-x86_64-3.9/faiss/_swigfaiss.cpython-39-x86_64-linux-gnu.so: undefined symbol: cublasLtMatrixTransformDescDestroy

Kota Yamaguchi · Answer 5 · Mon Apr 11 2022 23:09:38 GMT+0800 (China Standard Time)

Ok, changing the order of linker flag in setup.py seems to reduce the binary size.

Kota Yamaguchi · Answer 6 · Tue Apr 12 2022 23:16:40 GMT+0800 (China Standard Time)

With CUDA 11.6, the resulting wheel further goes up to 345MB in Linux. After nvprune, we get 276MB. This is still not good, as PyPI default limit is 60MB.

Kota Yamaguchi · Answer 7 · Tue Apr 12 2022 23:35:05 GMT+0800 (China Standard Time)

Alternative is to give up static linking and relies on dynamic linking. This will significantly reduce the wheel size, while requires users to install CUDA runtime libraries elsewhere.

Kota Yamaguchi · Answer 8 · Fri Nov 18 2022 07:54:04 GMT+0800 (China Standard Time)

With avx2 extension, the package is ~430MB.

Kota Yamaguchi · Answer 9 · Thu Jan 05 2023 20:48:50 GMT+0800 (China Standard Time)

It seems there are CUDA runtime packages on PyPI.
https://pypi.org/project/nvidia-cuda-runtime-cu11/

Shagit Ziganshin · Answer 10 · Wed Mar 15 2023 22:21:01 GMT+0800 (China Standard Time)

Hi!

Did you consider to place package on GitLab PyPI index or place it to dockerhub as image?

ping me if you need help

Kota Yamaguchi · Answer 11 · Thu Mar 16 2023 08:22:43 GMT+0800 (China Standard Time)

@theLastOfCats You can manually download packages from the release page.

Dai Ishita · Answer 12 · Mon Apr 22 2024 23:07:13 GMT+0800 (China Standard Time)

Hi @kyamagu!

For your reference, by changing from static linking to dynamic linking of CUDA, the wheel size has been reduced to 63MB.
It was dynamically linked with the shared libraries of the nvidia-cublas-cu12 and nvidia-cuda-runtime-cu12 packages, which are published on PyPi

It seems possible to reduce the wheel size to less than 60MB by either narrowing down the target architecture or switching from static linking to dynamic linking of OpenBLAS.

Fork Repository: https://github.com/Di-Is/faiss-wheels/tree/pypi-cuda

Build Script

# Test CMD
CPU_TEST_CMD="pytest {project}/faiss/tests && pytest -s {project}/faiss/tests/torch_test_contrib.py"
GPU_TEST_CMD="cp {project}/faiss/tests/common_faiss_tests.py {project}/faiss/faiss/gpu/test/ && pytest {project}/faiss/faiss/gpu/test/test_*.py && pytest {project}/faiss/faiss/gpu/test/torch_*.py"

# Common Setup
export CIBW_BEFORE_ALL="bash scripts/build_Linux.sh"
export CIBW_TEST_COMMAND="${CPU_TEST_CMD}"
export CIBW_BEFORE_TEST_LINUX="pip install torch --index-url https://download.pytorch.org/whl/cpu"
export CIBW_ENVIRONMENT_LINUX="FAISS_OPT_LEVEL=${FAISS_OPT_LEVEL:-generic} BUILD_PARALLELISM=${BUILD_PARALLELISM:-3} CUDA_VERSION=12.1"
export CIBW_DEBUG_KEEP_CONTAINER=TRUE

if [ "$FAISS_ENABLE_GPU" = "ON" ]; then
    if [ "$CONTAINER_GPU_ACCESS" = "ON" ]; then
        export CIBW_TEST_COMMAND="${CIBW_TEST_COMMAND} && ${GPU_TEST_CMD}"
        export CIBW_CONTAINER_ENGINE="docker; create_args: --gpus all"
        export -n CIBW_BEFORE_TEST_LINUX
    fi
    export CIBW_ENVIRONMENT_LINUX="${CIBW_ENVIRONMENT_LINUX} FAISS_ENABLE_GPU=ON"
    export CIBW_REPAIR_WHEEL_COMMAND="auditwheel repair -w {dest_dir} {wheel} --exclude libcublas.so.12 --exclude libcublasLt.so.12 --exclude libcudart.so.12"
else
    export CIBW_ENVIRONMENT_LINUX="${CIBW_ENVIRONMENT_LINUX} FAISS_ENABLE_GPU=OFF"
    export CIBW_REPAIR_WHEEL_COMMAND="auditwheel repair -w {dest_dir} {wheel}"
fi

python3 -m cibuildwheel --output-dir wheelhouse --platform linux

Kota Yamaguchi · Answer 13 · Tue Apr 23 2024 08:59:35 GMT+0800 (China Standard Time)

@Di-Is CUDA backward compatibility is complicated, and the PyPI release should not expect any external dependency other than a few linked to CPython binary. https://github.com/pypa/manylinux

You can build a source package for your environment, but that wheel will not be compatible with other environments.

Kota Yamaguchi · Answer 14 · Tue Apr 23 2024 09:03:17 GMT+0800 (China Standard Time)

Relevant thread https://discuss.python.org/t/what-to-do-about-gpus-and-the-built-distributions-that-support-them/7125/58

Dai Ishita · Answer 15 · Tue Apr 23 2024 17:53:21 GMT+0800 (China Standard Time)

CUDA backward compatibility is complicated,

I believe that installing the appropriate Nvidia drivers is not a matter of package management but rather a part of system setup, and the responsibility for execution lies with the user.
(This is also true for other package managers, e.g., Conda.)
Fortunately, installing the latest drivers will work with any version of CUDA and the binaries linked to it.

the PyPI release should not expect any external dependency other than a few linked to CPython binary.

It is correct that wheel files should be self-contained.
However, regarding this matter, it has been discussed in an auditwheel issue #368, and a feature to relax the restrictions has been merged into auditwheel.

Dai Ishita · Answer 16 · Tue Apr 23 2024 17:53:43 GMT+0800 (China Standard Time)

You can build a source package for your environment, but that wheel will not be compatible with other environments.

If the following conditions are met, Faiss installed from the created wheel should work properly.

Run Faiss in an environment with an Nvidia Driver installed that is compatible with the CUDA being used.
Do not load multiple versions of CUDA shared libraries in a single process (to avoid troublesome issues like symbol conflicts).

1.As mentioned earlier, it is the user's responsibility.
2.The system/package configuration should be reviewed, I believe.

Kota Yamaguchi · Answer 17 · Wed Apr 24 2024 08:27:42 GMT+0800 (China Standard Time)

@Di-Is

However, regarding this matter, it has been discussed in an auditwheel issue pypa/auditwheel#368 (comment), and a feature to relax the restrictions has been merged into auditwheel.

This is not a matter of auditwheel but more fundamental issues in Python dependency management. In the current PyPI policy, managing GPU dependency is hard unless there is a standardized toolchain to build and test wheels for combinations of compiler / CUDA / driver / CPU arch / OS / Python versions, and recently, the compatibility with other packages like PyTorch. At least the current PyPI distribution is not designed well for different CUDA runtimes. If we ignore that and ship wheels for a very specific runtime configuration, we end up seeing a flood of error reports both here and in the upstream, which is obviously not a good thing. Conda is different from PyPI in that conda does manage runtime environments (e.g., CUDA).

My current approach is to at least leave the source distribution that works with any custom environment. Right now, I can't spend time on the GPU binary distribution, but you can try designing a build and test matrix to resolve the issues in the above configurations.