Faiss training on GPU crash because number of IVF centroids changes in the middle of the training
JeanBaptiste-dlb opened this issue · comments
Summary
Hello and thanks in advance for the help.
I encountered a bug with faiss-gpu 1.7.2 in the training of index using GPU.
The problem is at some point during the training, the number of centroids of the inverted file index changes which leads to a matrix multiplication error and terminate the programs.
Platform
OS: Ubuntu 20.04.5 LTS x86_64
GPU: NVIDIA 01:00.0 NVIDIA Corporation Device 2203
(RTX 4090)
Driver Version: 515.65.01 CUDA Version: 11.7
Faiss version: 1.7.2
Installed from: poetry (pyPI)
Faiss compilation options: unknown
Running on:
- CPU
- [ * ] GPU
Interface:
- C++
- [ * ] Python
Reproduction instructions
https://gist.github.com/JeanBaptiste-dlb/a3aa1f93e2b247f61a9a83e5dfc0fb55
logs:
WARNING clustering 600 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 600 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 600 points to 512 centroids: please provide at least 19968 training points
Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (512, 64) x (512, 64)' = (512, 512) gemm params m 512 n 512 k 64 trA T trB N lda 64 ldb 64 ldc 512
Please install via anaconda.