Add parameter to drop labels for clustering metrics

Question

Add parameter to drop labels for clustering metrics

smarbal opened this issue a year ago · comments

Improvement suggestion

Because clustering algorithms can have metrics that use labels or not, it could be interesting to add a parameter to the train method that would allow to only keep the metrics that don't use labels.

Alex · Answer 1 · Mon Mar 20 2023 05:16:15 GMT+0800 (China Standard Time)

@smarbal Please test...

smarbal · Answer 2 · Thu Mar 23 2023 01:10:17 GMT+0800 (China Standard Time)

Hello,
Since commit 95b94c4, the metrics aren't printed at the end of the training anymore.

Steps to reproduce

Make a simple dataset : dataset make upx-PE -p upx -f PE
Train KMEANS on it : model train upx-PE -a kmeans --ignore-labels

Example

model train upx-PE -a kmeans --ignore-labels
00:00:03.692 [INFO] Selected algorithm: K-Means clustering
00:00:03.693 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:03.694 [INFO] Computing features...
00:00:37.711 [INFO] Making pipeline...
00:00:37.714 [INFO] Training model...
00:00:37.714 [INFO] (step 1/2) Processing standardize (StandardScaler)
00:00:37.718 [INFO] (step 2/2) Processing kmeans

Name: upx-PE_pe32-pe64_100_kmeans_f109


00:00:38.224 [INFO] Parameters:
- n_clusters = 8
- n_init = 10
- max_iter = 300
- tol = 0.0001
- algorithm = lloyd

The clustering metrics are present in the algorithms configuration file and before the commit the command outputted this :

$ model train upx-PE -a kmeans --ignore-labels
00:00:03.502 [INFO] Selected algorithm: K-Means clustering
00:00:03.503 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:03.505 [INFO] Computing features...
00:00:39.249 [INFO] Making pipeline...
00:00:39.252 [INFO] Training model...
00:00:39.252 [INFO] (step 1/2) Processing standardize (StandardScaler)
00:00:39.256 [INFO] (step 2/2) Processing kmeans

Name: upx-PE_pe32-pe64_100_kmeans_f109


  ─────  ────────────────  ───────────────────────  ────────────────────
  .      Silhouette Score  Calinski Harabasz Score  Davies Bouldin Score
  Train  -0.216            8.474                    10.921
  ─────  ────────────────  ───────────────────────  ────────────────────
...

Alex · Answer 3 · Thu Mar 23 2023 05:08:17 GMT+0800 (China Standard Time)

@smarbal You can see why by using model -v train .... I guess the predictions are all -1 meaning no label. I think there is something failing in the pipeline but did not figure out what yet.

smarbal · Answer 4 · Tue Mar 28 2023 16:55:07 GMT+0800 (China Standard Time)

@dhondta Verbosity didn't give any more information.
I tested changing parts of the code and by reverting _convert_output to it's previous state, I managed to have the metrics.
My guess is that the problem comes from this condition : if all(x == LABELS_BACK_CONV[NOT_LABELLED] for x in yp).

Alex · Answer 5 · Fri Mar 31 2023 23:50:00 GMT+0800 (China Standard Time)

@smarbal Please test.

smarbal · Answer 6 · Sat Apr 01 2023 17:46:42 GMT+0800 (China Standard Time)

Works as intended. Thank you.