packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`ValueError` for `train_test_split()` using unsupervised model

smarbal opened this issue · comments

When training an unsupervised model, the following error occurs :

$ model train upx-merged -a kmeans 
00:00:03.540 [INFO] Selected algorithm: K-Means clustering
00:00:03.542 [INFO] Reference dataset:  upx-non(PE32,PE64)
00:00:03.543 [INFO] Computing features...
00:17:51.300 [INFO] Making pipeline...
Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 116, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 519, in train
    if not self._prepare(**kw):
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 296, in _prepare
    train_test_split(self._data, self._target, test_size=tsize, random_state=42, stratify=self._target)
  File "/home/user/.local/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 2448, in train_test_split
    n_train, n_test = _validate_shuffle_split(
  File "/home/user/.local/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 2071, in _validate_shuffle_split
    raise ValueError(
ValueError: test_size=0 should be either positive and smaller than the number of samples 1909 or a float in the (0, 1) range
commented

@smarbal This should be solved by f7dabd7. Please test.

Th following error occurs now :

┌──[user@packing-box]──[/mnt/share]──[main|✓]──[✘ INT]────────                                                                           ────[172.17.0.4]──[19:38:41]────
$ model train upx-PE -a kmeans 
00:00:03.400 [INFO] Selected algorithm: K-Means clustering
00:00:03.401 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:03.403 [INFO] Computing features...
00:00:59.784 [INFO] Making pipeline...
00:00:59.787 [INFO] Training model...
Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 588, in train
    self.pipeline.fit(self._train.data, self._train.target.values.ravel())
AttributeError: 'numpy.ndarray' object has no attribute 'values'

Removing both to_numpy() on line 295 of model.py seems to fix the issue.

commented

@smarbal My bad, I thought the variables were of type numpy.array. I will fix this ASAP.

commented

@smarbal 304dd5d should fix this. Please test.

@dhondta Works as intended, thanks.