airbus-seclab / cpu_rec

Recognize cpu instructions in an arbitrary binary file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow usage of pretrained models

KOLANICH opened this issue · comments

What do you mean? What do you want to train these models on?

The docs is poor, but if I understand right, the tool does the following on every launch

  1. trains a model on the corpus
  2. then it applies the trained model to binaries which paths are in argv

The model is 1-NN on bigram and trigram frequencies as features.

So, the frequencies can be just serialized into a binary file and then loaded from that file.

Also, I wonder if XGBoost can give any better (needs measurement with crossvalidation (usually 10-fold, but more folds - more expensive the cv is, but gives more accurate results, in my previous research I used 24-fold), have you done any on your classifier?) results and more compact models (that can be transpiled into if-else C AST and then into machine code, there a tool from XGBoost authors transpiling into C (and some other langs) AST, and a tool by me transpiling into python AST).

OK. I understand your proposal of using pretrained models.
Please implement it and make some tests.
From my experience, the result is slower, because reading the corpus from disk and computing the frequencies takes less time than reading precomputed frequencies. I don't remember if I tried reading a xz-compressed frequency file.

I don't know about XGBoost. Note than in the case of cpu_rec, there are many architectures for which the corpus is very small. The consequence is that many results from machine learning don't apply to this case. My opinion is that only by experimenting one can see if the tool is improved.

PS: your summary of how the tool works is good, but I don't exactly use "bigram and trigram frequencies as features". The decision process is slightly different from standard feature-based classification, especially the way I decide to output None.

In #18 I implemented saving the frequencies in a pickle file, which is ~100MB, and makes startup very fast

The use of pickle as proposed by trou is now included in the tool.