YaleDHLab / pix-plot

A WebGL viewer for UMAP or TSNE-clustered images

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HDBSCAN not available

lakonis opened this issue · comments

Hello, I have the following packages running python 3.7.16:

tensorflow                     2.5.0
numpy                          1.19.5
hdbscan                        0.8.24
pixplot                        0.0.113

yet, pixplot gives me the following error when accessing my dataset and metadata csv:

2023-03-22 17:53:45.147839: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-22 17:53:45.147862: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-22 17:53:46.145373: HDBSCAN not available; using sklearn KMeans
2023-03-22 17:53:49.159517: CUML not available; using umap-learn UMAP
2023-03-22 17:53:49.159901: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-22 17:53:49.161109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-22 17:53:49.161125: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-22 17:53:49.161142: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (nicolas-hpeb830g8): /proc/driver/nvidia/version does not exist
2023-03-22 17:53:49.161469: I tensorflow/core/common_runtime/direct_session.cc:361] Device mapping: no known devices.

I don't understand the errors neither why HDBSCAN is not available

Thanks for your help!

Interesting -- if you start Python and try:

import hdbscan

...do you get no response (which is good!) or an error?

Error indeed :

> python                                                                                                
Python 3.7.16 (default, Mar 22 2023, 16:00:53) 
[GCC 12.2.1 20230201] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hdbscan
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages/hdbscan/__init__.py", line 1, in <module>
    from .hdbscan_ import HDBSCAN, hdbscan
  File "/home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 21, in <module>
    from ._hdbscan_linkage import (single_linkage,
  File "hdbscan/_hdbscan_linkage.pyx", line 1, in init hdbscan._hdbscan_linkage
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

So it has something to do with numpy.

I did try to install different versions of numpy and hdbscan corresponding to pixplot last release (2020). And during those tests I noticed this error:

> pip install hdbscan==0.8.29                                                                                                       
Collecting hdbscan==0.8.29
  Using cached hdbscan-0.8.29-cp37-cp37m-linux_x86_64.whl
Collecting numpy>=1.20
  Using cached numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Requirement already satisfied: scikit-learn>=0.20 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from hdbscan==0.8.29) (0.24.2)
Requirement already satisfied: scipy>=1.0 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from hdbscan==0.8.29) (1.4.0)
Requirement already satisfied: cython>=0.27 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from hdbscan==0.8.29) (0.29.33)
Requirement already satisfied: joblib>=1.0 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from hdbscan==0.8.29) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/nicolas/.pyenv/versions/3.7.16/lib/python3.7/site-packages (from scikit-learn>=0.20->hdbscan==0.8.29) (3.1.0)
Installing collected packages: numpy, hdbscan
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
  Attempting uninstall: hdbscan
    Found existing installation: hdbscan 0.8.26
    Uninstalling hdbscan-0.8.26:
      Successfully uninstalled hdbscan-0.8.26
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.5.0 requires numpy~=1.19.2, but you have numpy 1.21.6 which is incompatible.
pixplot 0.0.113 requires numpy==1.19.5, but you have numpy 1.21.6 which is incompatible.
Successfully installed hdbscan-0.8.29 numpy-1.21.6
WARNING: You are using pip version 22.0.4; however, version 23.0.1 is available.
You should consider upgrading via the '/home/nicolas/.pyenv/versions/3.7.16/bin/python3.7 -m pip install --upgrade pip' command.

pixplot has worked (with "hdbscan not available") with config numpy==1.19.5 and hdbscan=0.8.24-0.8.29

I believe it has something to do with tensorflow, cuda, libcudart.so.11.0, etc. I am not sure I want to go that deep since I am using pixplot for ~1000 images dataset and an Intel GPU, which involves more heavy installations..

However, it seems that hdbscan takes into account the label/category column into the clustering, which is particularly interesting in my case. I believe the sklearn KMeans does not, is that correct ?

Am I missing something else without CUML ?

HDBSCAN not available; using sklearn KMeans
CUML not available; using umap-learn UMAP

Thank you !

CUML is just a library that contains an accelerated implementation of UMAP; no worries there. You're correct that there are some real annoyances around numba and numpy; not sure if you're on Linux or not but there's some notes on the very end of this wiki page that might help:

https://github.com/YaleDHLab/pix-plot/wiki/Ubuntu-20-&-22-with-GPU

I am on Linux Manjaro, but I have a GPU Intel. Therefore, I am trying this, installing intel-extension-for-tensorflow 1.1.0, but it upgrades everything and breaks pixplot requirements.

Again, GPU or speed is not crucial to me. It's rather hdbscan that could improve my clustering from what I understand. But maybe I am mistaking ?