Add parameter to drop labels for clustering metrics
smarbal opened this issue · comments
Improvement suggestion
Because clustering algorithms can have metrics that use labels or not, it could be interesting to add a parameter to the train
method that would allow to only keep the metrics that don't use labels.
Hello,
Since commit 95b94c4, the metrics aren't printed at the end of the training anymore.
Steps to reproduce
- Make a simple dataset :
dataset make upx-PE -p upx -f PE
- Train KMEANS on it :
model train upx-PE -a kmeans --ignore-labels
Example
model train upx-PE -a kmeans --ignore-labels
00:00:03.692 [INFO] Selected algorithm: K-Means clustering
00:00:03.693 [INFO] Reference dataset: upx-PE(PE32,PE64)
00:00:03.694 [INFO] Computing features...
00:00:37.711 [INFO] Making pipeline...
00:00:37.714 [INFO] Training model...
00:00:37.714 [INFO] (step 1/2) Processing standardize (StandardScaler)
00:00:37.718 [INFO] (step 2/2) Processing kmeans
Name: upx-PE_pe32-pe64_100_kmeans_f109
00:00:38.224 [INFO] Parameters:
- n_clusters = 8
- n_init = 10
- max_iter = 300
- tol = 0.0001
- algorithm = lloyd
The clustering metrics are present in the algorithms configuration file and before the commit the command outputted this :
$ model train upx-PE -a kmeans --ignore-labels
00:00:03.502 [INFO] Selected algorithm: K-Means clustering
00:00:03.503 [INFO] Reference dataset: upx-PE(PE32,PE64)
00:00:03.505 [INFO] Computing features...
00:00:39.249 [INFO] Making pipeline...
00:00:39.252 [INFO] Training model...
00:00:39.252 [INFO] (step 1/2) Processing standardize (StandardScaler)
00:00:39.256 [INFO] (step 2/2) Processing kmeans
Name: upx-PE_pe32-pe64_100_kmeans_f109
───── ──────────────── ─────────────────────── ────────────────────
. Silhouette Score Calinski Harabasz Score Davies Bouldin Score
Train -0.216 8.474 10.921
───── ──────────────── ─────────────────────── ────────────────────
...
@smarbal You can see why by using model -v train ...
. I guess the predictions are all -1
meaning no label. I think there is something failing in the pipeline but did not figure out what yet.
@dhondta Verbosity didn't give any more information.
I tested changing parts of the code and by reverting _convert_output
to it's previous state, I managed to have the metrics.
My guess is that the problem comes from this condition : if all(x == LABELS_BACK_CONV[NOT_LABELLED] for x in yp)
.
Works as intended. Thank you.