visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: FastDup Error "Largest cluster has 20,318 (171.55%) images."

domini8888 opened this issue · comments

What happened?

Hi! So I ran the guide to extract dataset feature vectors with DINOv2 locally on my computer, and it ran succesfully, but its output is strange saying that the largest cluster has 20,318 images, when I only have 11,000ish images. Why so?

What did you expect to see?

Largest cluster having less than total number of images

What version of fastdup were you runnning on?

1.73

What version of Python were you running on?

Python 3.9

Operating System

Ubuntu 20.04

Reproduction steps

No response

Relevant log output

>>> fd.run(model_path='dinov2s', cc_threshold=0.8)
/home/domini8888/my_env_project/lib/python3.8/site-packages/fastdup/fastdup_controller.py:540: UserWarning: Fastdup was already applied, use overwrite=True to re-run
  warnings.warn('Fastdup was already applied, use overwrite=True to re-run')
>>> fd.run(model_path='dinov2s', cc_threshold=0.8, overwrite=True)
FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-12-13 18:28:55 [INFO] Found resent/efficientnet/dinov2 model, setting up normalization
2023-12-13 18:28:55 [INFO] Going to loop over dir images
2023-12-13 18:28:55 [INFO] Found total 11844 images to run on, 11844 train, 0 test, name list 11844, counter 11844
2023-12-13 18:28:55 [ERROR] Image images/-kmnz2AVyxW451cfZclkCA_150.jpg_20_segment.jpg is too small, image size is 6 x 39, min_input_image_width=10
2023-12-13 18:28:56 [ERROR] Image images/-kmnz2AVyxW451cfZclkCA_240.jpg_15_segment.jpg is too small, image size is 20 x 5, min_input_image_width=10
2023-12-13 18:28:57 [ERROR] Image images/-kmnz2AVyxW451cfZclkCA_330.jpg_26_segment.jpg is too small, image size is 8 x 16, min_input_image_width=10
2023-12-13 18:28:57 [ERROR] Image images/-kmnz2AVyxW451cfZclkCA_330.jpg_28_segment.jpg is too small, image size is 6 x 19, min_input_image_width=10
2023-12-13 18:28:57 [ERROR] Image images/-kmnz2AVyxW451cfZclkCA_330.jpg_41_segment.jpg is too small, image size is 9 x 80, min_input_image_width=10
2023-12-13 18:28:57 [ERROR] Image images/-kmnz2AVyxW451cfZclkCA_330.jpg_45_segment.jpg is too small, image size is 5 x 34, min_input_image_width=10
2023-12-13 18:28:58 [ERROR] Image images/0VqYcY_jRAdTBTpnFEGCoA_150.jpg_18_segment.jpg is too small, image size is 11 x 9, min_input_image_width=10
2023-12-13 18:28:58 [ERROR] Image images/0VqYcY_jRAdTBTpnFEGCoA_240.jpg_39_segment.jpg is too small, image size is 9 x 15, min_input_image_width=10
2023-12-13 18:28:58 [ERROR] Image images/0VqYcY_jRAdTBTpnFEGCoA_240.jpg_9_segment.jpg is too small, image size is 9 x 18, min_input_image_width=10
2023-12-13 18:28:59 [ERROR] Image images/0VqYcY_jRAdTBTpnFEGCoA_330.jpg_19_segment.jpg is too small, image size is 14 x 7, min_input_image_width=10
2023-12-13 18:28:59 [ERROR] Image images/0VqYcY_jRAdTBTpnFEGCoA_330.jpg_27_segment.jpg is too small, image size is 13 x 6, min_input_image_width=10
2023-12-13 18:32:37 [INFO] Found total 11844 images to run onmated: 0 Minutes
Finished histogram 4.559
Finished bucket sort 4.644
2023-12-13 18:32:38 [INFO] 1326) Finished write_index() NN model
2023-12-13 18:32:38 [INFO] Stored nn model index file work_dir/nnf.index
2023-12-13 18:32:38 [INFO] Total time took 222515 ms
2023-12-13 18:32:38 [INFO] Found a total of 10 fully identical images (d>0.990), which are 0.04 % of total graph edges
2023-12-13 18:32:38 [INFO] Found a total of 504 nearly identical images(d>0.980), which are 2.13 % of total graph edges
2023-12-13 18:32:38 [INFO] Found a total of 4219 above threshold images (d>0.900), which are 17.81 % of total graph edges
2023-12-13 18:32:38 [INFO] Found a total of 1027 outlier images         (d<0.050), which are 4.34 % of total graph edges
2023-12-13 18:32:38 [INFO] Min distance found 0.394 max distance 0.992
2023-12-13 18:32:38 [INFO] Running connected components for ccthreshold 0.800000
.0
 ########################################################################################

Dataset Analysis Summary:

    Dataset contains 11844 images
    Valid images are 86.76% (10,276) of the data, invalid are 13.24% (1,568) of the data
    For a detailed analysis, use `.invalid_instances()`.

    Similarity:  53.89% (6,383) belong to 23 similarity clusters (components).
    46.11% (5,461) images do not belong to any similarity cluster.
    Largest cluster has 20,318 (171.55%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.9, connected component threshold used is 0.8).

    Outliers: 5.52% (654) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.

########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################

Attach a screenshot [Optional]

No response

Contact Details [Optional]

dominicc@mit.edu

Hi @domini8888 the galleries are created as follows:

fd.vis.duplicates_gallery()    # create a visual gallery of duplicates
fd.vis.outliers_gallery()      # create a visual gallery of anomalies
fd.vis.component_gallery()     # create a visualization of connected components
fd.vis.stats_gallery()         # create a visualization of images statistics (e.g. blur)
fd.vis.similarity_gallery()    # create a gallery of similar images

In case you are working inside a jupyter notebook you will see a gallery view, otherwise if you work in a python terminal an html file will be created you can view it using any browser.
Let us know if this works.