Unrealistic model performance

Question

Unrealistic model performance

AlexVanMechelen opened this issue 3 months ago · comments

Alex Van Mechelen commented 3 months ago

Issue

I kept getting unrealistic model performances of 100% for each metric in any experiment, so I pulled it to the extreme as a POC:

Demo experiment

Using just 1 randomly selected feature byte_17_after_ep of which I believe it has little predictive power for datasets with a high variation of packer families, a RF model was trained on a dataset with many different compressor families (very low probability that the 17th byte after the EP has a common trend for all of them, never occurring in any of the not-packed samples).

for P in ASPack BeRoEXEPacker MEW MPRESS NSPack Packman PECompact UPX; do dataset update tmp -n 50 -s dataset-packed-pe/packed/$P -l dataset-packed-pe/labels/labels-compressor.json; done
dataset update tmp -s dataset-packed-pe/not-packed -n 400
dataset select -n 200 -s tmp tmp2

Listing the datasets:

dataset list

Datasets (10)
                                                                             
  Name    #Executables   Size    Files       Formats            Packers      
 ───────────────────────────────────────────────────────────────────────────
  tmp     600            164MB   yes     PE                 compressor{307}  
  tmp2    200            32MB    yes     PE                 compressor{93}

Training the model gives perfect metrics:

model train tmp -A rf
<<snipped>>
Classification metrics                                              
                                                                    
    .     Accuracy   Precision   Recall    F-Measure    MCC    AUC  
 ────────────────────────────────────────────────────────────────── 
  Train   100.00%    100.00%     100.00%   100.00%     0.00%   -    
  Test    100.00%    100.00%     100.00%   100.00%     0.00%   -

Testing the model with a dataset with no overlap also gives perfect metrics:

model test tmp_pe_600_rf_f1 tmp2
<<snipped>>
Classification metrics                                                        
                                                                              
  Accuracy   Precision   Recall    F-Measure    MCC    AUC   Processing Time  
 ──────────────────────────────────────────────────────────────────────────── 
  100.00%    100.00%     100.00%   100.00%     0.00%   -     10.816ms

Question

Am I maybe doing something wrong?

Alex · Answer 1 · Fri Apr 26 2024 04:27:41 GMT+0800 (China Standard Time)

If I get it right, you use a single feature to train your model ?

…

On Mon, 22 Apr 2024, 20:52 Alex Van Mechelen, ***@***.***> wrote: Issue I kept getting unrealistic model performances of 100% for each metric in any experiment, so I pulled it to the extreme as a POC: Demo experiment Using just 1 randomly selected feature byte_17_after_ep of which I believe it has little predictive power for datasets with a high variation of packer families, a RF model was trained on a dataset with many different compressor families (very low probability that the 17th byte after the EP has a common trend for all of them, never occurring in any of the not-packed samples). for P in ASPack BeRoEXEPacker MEW MPRESS NSPack Packman PECompact UPX; do dataset update tmp -n 50 -s dataset-packed-pe/packed/$P -l dataset-packed-pe/labels/labels-compressor.json; done dataset update tmp -s dataset-packed-pe/not-packed -n 400 dataset select -n 200 -s tmp tmp2 Listing the datasets: dataset list Datasets (10) Name #Executables Size Files Formats Packers ─────────────────────────────────────────────────────────────────────────── tmp 600 164MB yes PE compressor{307} tmp2 200 32MB yes PE compressor{93} Training the model gives perfect metrics: model train tmp -A rf <<snipped>> Classification metrics . Accuracy Precision Recall F-Measure MCC AUC ────────────────────────────────────────────────────────────────── Train 100.00% 100.00% 100.00% 100.00% 0.00% - Test 100.00% 100.00% 100.00% 100.00% 0.00% - Testing the model with a dataset with no overlap also gives perfect metrics: model test tmp_pe_600_rf_f1 tmp2 <<snipped>> Classification metrics Accuracy Precision Recall F-Measure MCC AUC Processing Time ──────────────────────────────────────────────────────────────────────────── 100.00% 100.00% 100.00% 100.00% 0.00% - 10.816ms Question Am I maybe doing something wrong? — Reply to this email directly, view it on GitHub <#110>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACFPVBWLYZ4ZJOOKGF2ITNTY6VL7ZAVCNFSM6AAAAABGTKQO4KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TOMRUGY2TCMA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Alex Van Mechelen · Answer 2 · Fri Apr 26 2024 16:56:25 GMT+0800 (China Standard Time)

@dhondta For the above demo yes, to emphasise that 100% on all metric is unrealistic in that scenario. Besides the above experiment I've tried many other configurations always resulting in perfect metrics

Alex Van Mechelen · Answer 3 · Fri Apr 26 2024 22:54:51 GMT+0800 (China Standard Time)

Conclusion

The binary classifier looks for samples labeled as "not-packed" and labels them "False", while any other label gets put as "true". Non-labeled samples are rejected and don't make it to the model training. Therefore only one class arrives in the model training, yielding perfect metrics.

Feature

It would be useful to be able to specify with for example a flag "-L" in the dataset convert command to assign the "not-packed" label to those features. This would allow to perform experiments where class 1 = "cryptors" and class2 comprises of non-cryptors (including samples packed with packers not belonging to the cryptor category, but also not-packed samples), in this case all labeled as "not-packed" for correct interpretation by the tool

Alex · Answer 4 · Sun Apr 28 2024 06:48:02 GMT+0800 (China Standard Time)

@AlexVanMechelen see commit 8112fc5 ; you can now use -T with model train to solve this issue. Please test and report.

Alex Van Mechelen · Answer 5 · Sun Apr 28 2024 18:45:21 GMT+0800 (China Standard Time)

Tested & functional.
Encountered one issue, fixed in #114