kyamagu / faiss-wheels

Unofficial faiss wheel builder

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reduce wheel package size for faiss-gpu CUDA 11.0 build

kyamagu opened this issue · comments

The CUDA 11.0 build in #56 bloats the wheel package size from 85.5 MB to 216.5 MB. Needs to investigate file size reduction.

Seems one approach is to drop architecture-specific binary in CUDA libraries via nvprune, like this:

nvprune \
  -gencode arch=compute_60,code=sm_60 \
  -gencode arch=compute_70,code=sm_70 \
  -gencode arch=compute_75,code=sm_75 \
  -gencode arch=compute_80,code=sm_80 \
  -gencode arch=compute_80,code=compute_80 \
  -o /usr/local/cuda/lib64/libcublas_static_slim.a \
  /usr/local/cuda/lib64/libcublas_static.a

Currently there are four dependencies, and applying nvprune slightly reduces the binary size.

  • libcublas_static.a
  • libcublasLt_static.a
  • libcudart_static.a
  • libculibos.a

In Python 3.9, the original file size of _swigfaiss.cpython-39-x86_64-linux-gnu.so was 341MB, while applying nvprune to all the static libs results in 310MB. This is still huge.

The major problem is that CUDA 11.0 splits cublasLt API into a different static lib, and that seems to significantly increase the final binary size. In CUDA 10.x, cublasLt API was within the single static lib.

libcublasLt_static.a 224M
libcublas_static.a 82M
libcudart_static.a 910K
libculibos.a 31K

Strangely, faiss does not use cublasLt API. But when omitting -lcublasLt_static in the linker flag of setup.py, we see the following error on import. Why does that happen?

ImportError: /workspace/faiss-wheels/build/lib.linux-x86_64-3.9/faiss/_swigfaiss.cpython-39-x86_64-linux-gnu.so: undefined symbol: cublasLtMatrixTransformDescDestroy

Ok, changing the order of linker flag in setup.py seems to reduce the binary size.

With CUDA 11.6, the resulting wheel further goes up to 345MB in Linux. After nvprune, we get 276MB. This is still not good, as PyPI default limit is 60MB.

Alternative is to give up static linking and relies on dynamic linking. This will significantly reduce the wheel size, while requires users to install CUDA runtime libraries elsewhere.

With avx2 extension, the package is ~430MB.

It seems there are CUDA runtime packages on PyPI.
https://pypi.org/project/nvidia-cuda-runtime-cu11/

Hi!

Did you consider to place package on GitLab PyPI index or place it to dockerhub as image?

ping me if you need help

@theLastOfCats You can manually download packages from the release page.

Hi @kyamagu!

For your reference, by changing from static linking to dynamic linking of CUDA, the wheel size has been reduced to 63MB.
It was dynamically linked with the shared libraries of the nvidia-cublas-cu12 and nvidia-cuda-runtime-cu12 packages, which are published on PyPi

It seems possible to reduce the wheel size to less than 60MB by either narrowing down the target architecture or switching from static linking to dynamic linking of OpenBLAS.

Fork Repository: https://github.com/Di-Is/faiss-wheels/tree/pypi-cuda

Build Script
# Test CMD
CPU_TEST_CMD="pytest {project}/faiss/tests && pytest -s {project}/faiss/tests/torch_test_contrib.py"
GPU_TEST_CMD="cp {project}/faiss/tests/common_faiss_tests.py {project}/faiss/faiss/gpu/test/ && pytest {project}/faiss/faiss/gpu/test/test_*.py && pytest {project}/faiss/faiss/gpu/test/torch_*.py"

# Common Setup
export CIBW_BEFORE_ALL="bash scripts/build_Linux.sh"
export CIBW_TEST_COMMAND="${CPU_TEST_CMD}"
export CIBW_BEFORE_TEST_LINUX="pip install torch --index-url https://download.pytorch.org/whl/cpu"
export CIBW_ENVIRONMENT_LINUX="FAISS_OPT_LEVEL=${FAISS_OPT_LEVEL:-generic} BUILD_PARALLELISM=${BUILD_PARALLELISM:-3} CUDA_VERSION=12.1"
export CIBW_DEBUG_KEEP_CONTAINER=TRUE

if [ "$FAISS_ENABLE_GPU" = "ON" ]; then
    if [ "$CONTAINER_GPU_ACCESS" = "ON" ]; then
        export CIBW_TEST_COMMAND="${CIBW_TEST_COMMAND} && ${GPU_TEST_CMD}"
        export CIBW_CONTAINER_ENGINE="docker; create_args: --gpus all"
        export -n CIBW_BEFORE_TEST_LINUX
    fi
    export CIBW_ENVIRONMENT_LINUX="${CIBW_ENVIRONMENT_LINUX} FAISS_ENABLE_GPU=ON"
    export CIBW_REPAIR_WHEEL_COMMAND="auditwheel repair -w {dest_dir} {wheel} --exclude libcublas.so.12 --exclude libcublasLt.so.12 --exclude libcudart.so.12"
else
    export CIBW_ENVIRONMENT_LINUX="${CIBW_ENVIRONMENT_LINUX} FAISS_ENABLE_GPU=OFF"
    export CIBW_REPAIR_WHEEL_COMMAND="auditwheel repair -w {dest_dir} {wheel}"
fi

python3 -m cibuildwheel --output-dir wheelhouse --platform linux

@Di-Is CUDA backward compatibility is complicated, and the PyPI release should not expect any external dependency other than a few linked to CPython binary. https://github.com/pypa/manylinux

You can build a source package for your environment, but that wheel will not be compatible with other environments.

CUDA backward compatibility is complicated,

I believe that installing the appropriate Nvidia drivers is not a matter of package management but rather a part of system setup, and the responsibility for execution lies with the user.
(This is also true for other package managers, e.g., Conda.)
Fortunately, installing the latest drivers will work with any version of CUDA and the binaries linked to it.

the PyPI release should not expect any external dependency other than a few linked to CPython binary.

It is correct that wheel files should be self-contained.
However, regarding this matter, it has been discussed in an auditwheel issue #368, and a feature to relax the restrictions has been merged into auditwheel.

You can build a source package for your environment, but that wheel will not be compatible with other environments.

If the following conditions are met, Faiss installed from the created wheel should work properly.

  1. Run Faiss in an environment with an Nvidia Driver installed that is compatible with the CUDA being used.
  2. Do not load multiple versions of CUDA shared libraries in a single process (to avoid troublesome issues like symbol conflicts).

1.As mentioned earlier, it is the user's responsibility.
2.The system/package configuration should be reviewed, I believe.

@Di-Is

However, regarding this matter, it has been discussed in an auditwheel issue pypa/auditwheel#368 (comment), and a feature to relax the restrictions has been merged into auditwheel.

This is not a matter of auditwheel but more fundamental issues in Python dependency management. In the current PyPI policy, managing GPU dependency is hard unless there is a standardized toolchain to build and test wheels for combinations of compiler / CUDA / driver / CPU arch / OS / Python versions, and recently, the compatibility with other packages like PyTorch. At least the current PyPI distribution is not designed well for different CUDA runtimes. If we ignore that and ship wheels for a very specific runtime configuration, we end up seeing a flood of error reports both here and in the upstream, which is obviously not a good thing. Conda is different from PyPI in that conda does manage runtime environments (e.g., CUDA).

My current approach is to at least leave the source distribution that works with any custom environment. Right now, I can't spend time on the GPU binary distribution, but you can try designing a build and test matrix to resolve the issues in the above configurations.