packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add parameter to drop labels for clustering metrics

smarbal opened this issue · comments

Improvement suggestion

Because clustering algorithms can have metrics that use labels or not, it could be interesting to add a parameter to the train method that would allow to only keep the metrics that don't use labels.

commented

@smarbal Please test...

Hello,
Since commit 95b94c4, the metrics aren't printed at the end of the training anymore.

Steps to reproduce

  • Make a simple dataset : dataset make upx-PE -p upx -f PE
  • Train KMEANS on it : model train upx-PE -a kmeans --ignore-labels

Example

model train upx-PE -a kmeans --ignore-labels
00:00:03.692 [INFO] Selected algorithm: K-Means clustering
00:00:03.693 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:03.694 [INFO] Computing features...
00:00:37.711 [INFO] Making pipeline...
00:00:37.714 [INFO] Training model...
00:00:37.714 [INFO] (step 1/2) Processing standardize (StandardScaler)
00:00:37.718 [INFO] (step 2/2) Processing kmeans

Name: upx-PE_pe32-pe64_100_kmeans_f109


00:00:38.224 [INFO] Parameters:
- n_clusters = 8
- n_init = 10
- max_iter = 300
- tol = 0.0001
- algorithm = lloyd

The clustering metrics are present in the algorithms configuration file and before the commit the command outputted this :

$ model train upx-PE -a kmeans --ignore-labels
00:00:03.502 [INFO] Selected algorithm: K-Means clustering
00:00:03.503 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:03.505 [INFO] Computing features...
00:00:39.249 [INFO] Making pipeline...
00:00:39.252 [INFO] Training model...
00:00:39.252 [INFO] (step 1/2) Processing standardize (StandardScaler)
00:00:39.256 [INFO] (step 2/2) Processing kmeans

Name: upx-PE_pe32-pe64_100_kmeans_f109


  ─────  ────────────────  ───────────────────────  ────────────────────
  .      Silhouette Score  Calinski Harabasz Score  Davies Bouldin Score
  Train  -0.216            8.474                    10.921
  ─────  ────────────────  ───────────────────────  ────────────────────
...
commented

@smarbal You can see why by using model -v train .... I guess the predictions are all -1 meaning no label. I think there is something failing in the pipeline but did not figure out what yet.

@dhondta Verbosity didn't give any more information.
I tested changing parts of the code and by reverting _convert_output to it's previous state, I managed to have the metrics.
My guess is that the problem comes from this condition : if all(x == LABELS_BACK_CONV[NOT_LABELLED] for x in yp).

commented

@smarbal Please test.

Works as intended. Thank you.