packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`KeyError` using modified features set

smarbal opened this issue · comments

Description

Issue occurs when training a model on a file-less dataset that has been made with a reduced features set.
The box is up to date, the features configuration file is in the same state as when the dataset was made.

Traceback

┌──[user@packing-box]──[/mnt/share]──[main|+2]────────                                                                                   ────[172.17.0.4]──[13:41:20]────
$ model train fs-ds-PE -a kmeans
00:00:03.157 [INFO] Selected algorithm: K-Means clustering
00:00:03.165 [INFO] Reference dataset:  fs-ds-PE(PE32,PE64)
00:00:03.167 [INFO] Loading features...
Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 527, in train
    if not self._prepare(**kw):
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 213, in _prepare
    self._data = ds._data[list(exe.features.keys())]
  File "/home/user/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3811, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "/home/user/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6113, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/home/user/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['byte_0_after_ep', 'byte_1_after_ep', 'byte_2_after_ep', 'byte_3_after_ep', 'byte_4_after_ep', 'byte_5_after_ep', 'byte_6_after_ep', 'byte_7_after_ep', 'byte_8_after_ep', 'byte_9_after_ep', 'byte_10_after_ep', 'byte_11_after_ep', 'byte_12_after_ep', 'byte_13_after_ep', 'byte_14_after_ep', 'byte_15_after_ep', 'byte_16_after_ep', 'byte_17_after_ep', 'byte_18_after_ep', 'byte_19_after_ep', 'byte_20_after_ep', 'byte_21_after_ep', 'byte_22_after_ep', 'byte_23_after_ep', 'byte_24_after_ep', 'byte_25_after_ep', 'byte_26_after_ep', 'byte_27_after_ep', 'byte_28_after_ep', 'byte_29_after_ep', 'byte_30_after_ep', 'byte_31_after_ep', 'byte_32_after_ep', 'byte_33_after_ep', 'byte_34_after_ep', 'byte_35_after_ep', 'byte_36_after_ep', 'byte_37_after_ep', 'byte_38_after_ep', 'byte_39_after_ep', 'byte_40_after_ep', 'byte_41_after_ep', 'byte_42_after_ep', 'byte_43_after_ep', 'byte_44_after_ep', 'byte_45_after_ep', 'byte_46_after_ep', 'byte_47_after_ep', 'byte_48_after_ep', 'byte_49_after_ep', 'byte_50_after_ep', 'byte_51_after_ep', 'byte_52_after_ep', 'byte_53_after_ep', 'byte_54_after_ep', 'byte_55_after_ep', 'byte_56_after_ep', 'byte_57_after_ep', 'byte_58_after_ep', 'byte_59_after_ep', 'byte_60_after_ep', 'byte_61_after_ep', 'byte_62_after_ep', 'byte_63_after_ep'] not in index"
commented

@smarbal Please share your data.csv. I guess you use some kind of corrupted data.

The dataset and it's features.yml file are here : https://github.com/packing-box/experiments-unsupervised-learning/tree/main/datasets/reduced-bytes-after-EP-features

Today I was able to create a dataset and train a model on it with a reduced features set by completely removing the feature from the configuration file. So this might be a problem with how keep: False is processed.
I've left the keyword in the linked configuration file so the error is reproducible.

commented

@smarbal Do you still experience the same issue ?

commented

Could not reproduce the issue anymore. This may have been fixed in a previous commit.