visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request]: mean_distance in image cluster relative to centroïd + distance between different clusters (using centroïds)

jeanmarie-dormoy opened this issue · comments

Feature Name

Distances relative to centroïd

Feature Description

hello, many thanks for open sourcing your awesome library which is very useful for our data visualization needs

Context :

I already ran fastdup on our image dataset, carefully read the documentation and I also looked at the fastdup python code and if I summarize correctly :

1°) first data computed is similarity_score between (imgX, imgY) in the image dataset
2°) with data computed at 1°), fd.connected_components() and fd.connected_components_grouped() compute visual clusters containing images

I apologize if my explanations are sometimes inaccurate (not a ML specialist). As far as I have understood, the fastdup shared C++ library (called in python code with do_run function) is not accessible to public.

First question

Related to the dataframe df returned by : df, _ = fd.connected_components()
This dataframe has columns : index,component_id,mean_distance,min_distance,max_distance,filename
Referring to the image below (representing a visual cluster of images), is the df mean_distance :

  • a mean of all green distances ?
  • or a mean of all red distances (distances of images relative to centroïd) ?

Same question applies for min_distance and max_distance.
I ask these questions because we ideally need distances of images relative to centroïd, i.e., we need in each cluster :

  • d(imgX, centroïd) for each imgX in cluster
  • mean_distance of the set { d(imgX, centroïd) for each imgX in cluster }

mean_distance

Second question

This question arise from a need to visualize distance between image clusters.
We thought it would be a good approximation to compute distance between cluster centroïds, as shown in following image.
The idea is to get a list of tuples/rows (component_id_from, component_id_to, distance_between_centroids)

cluster_distance

If the asked features do not exist yet, do you think it would be possible on your side to easily add them ?
Alternatively, do you have some method in mind we can apply to compute these distances relative to centroïds ourselves, based on data that it is already possible to generate ?

looking forward to hearing from you
best regards

Contact Information [Optional]

jean-marie.dormoy@univ-cotedazur.fr

Hello @jeanmarie-dormoy thanks for using our fastdup packge!
The connected components clustering is a simple clustering algorithm as explained here: https://en.wikipedia.org/wiki/Component_(graph_theory)

The mean, max and min distance are relating to the edge distances inside the component and not distance towards the mean. Fastdup supports also the kmeans algorithm as shown here: https://www.kaggle.com/code/graphlab/fastdup-kmeans.

Regarding your second question, it is possibile to load the binary feature using https://visual-layer.github.io/fastdup/#fastdup.load_binary_feature, then you can compute the average feature vector for the cluster and then compute cosine similarity between the clusters. This is an approximation.

Let us know if we can help with anything else!