packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unrealistic model performance

AlexVanMechelen opened this issue · comments

Issue

I kept getting unrealistic model performances of 100% for each metric in any experiment, so I pulled it to the extreme as a POC:

Demo experiment

Using just 1 randomly selected feature byte_17_after_ep of which I believe it has little predictive power for datasets with a high variation of packer families, a RF model was trained on a dataset with many different compressor families (very low probability that the 17th byte after the EP has a common trend for all of them, never occurring in any of the not-packed samples).

for P in ASPack BeRoEXEPacker MEW MPRESS NSPack Packman PECompact UPX; do dataset update tmp -n 50 -s dataset-packed-pe/packed/$P -l dataset-packed-pe/labels/labels-compressor.json; done
dataset update tmp -s dataset-packed-pe/not-packed -n 400
dataset select -n 200 -s tmp tmp2

Listing the datasets:

dataset list

Datasets (10)
                                                                             
  Name    #Executables   Size    Files       Formats            Packers      
 ───────────────────────────────────────────────────────────────────────────
  tmp     600            164MB   yes     PE                 compressor{307}  
  tmp2    200            32MB    yes     PE                 compressor{93}

Training the model gives perfect metrics:

model train tmp -A rf
<<snipped>>
Classification metrics                                              
                                                                    
    .     Accuracy   Precision   Recall    F-Measure    MCC    AUC  
 ────────────────────────────────────────────────────────────────── 
  Train   100.00%    100.00%     100.00%   100.00%     0.00%   -    
  Test    100.00%    100.00%     100.00%   100.00%     0.00%   -   

Testing the model with a dataset with no overlap also gives perfect metrics:

model test tmp_pe_600_rf_f1 tmp2
<<snipped>>
Classification metrics                                                        
                                                                              
  Accuracy   Precision   Recall    F-Measure    MCC    AUC   Processing Time  
 ──────────────────────────────────────────────────────────────────────────── 
  100.00%    100.00%     100.00%   100.00%     0.00%   -     10.816ms        

Question

Am I maybe doing something wrong?

commented

@dhondta For the above demo yes, to emphasise that 100% on all metric is unrealistic in that scenario. Besides the above experiment I've tried many other configurations always resulting in perfect metrics

Conclusion

The binary classifier looks for samples labeled as "not-packed" and labels them "False", while any other label gets put as "true". Non-labeled samples are rejected and don't make it to the model training. Therefore only one class arrives in the model training, yielding perfect metrics.

Feature

It would be useful to be able to specify with for example a flag "-L" in the dataset convert command to assign the "not-packed" label to those features. This would allow to perform experiments where class 1 = "cryptors" and class2 comprises of non-cryptors (including samples packed with packers not belonging to the cryptor category, but also not-packed samples), in this case all labeled as "not-packed" for correct interpretation by the tool

commented

@AlexVanMechelen see commit 8112fc5 ; you can now use -T with model train to solve this issue. Please test and report.

Tested & functional.
Encountered one issue, fixed in #114