Allow usage of pretrained models

Question

Allow usage of pretrained models

KOLANICH opened this issue 3 years ago · comments

KOLANICH commented 3 years ago

Louis Granboulan · Answer 1 · Tue Jan 19 2021 01:52:21 GMT+0800 (China Standard Time)

What do you mean? What do you want to train these models on?

KOLANICH · Answer 2 · Tue Jan 19 2021 03:13:12 GMT+0800 (China Standard Time)

The docs is poor, but if I understand right, the tool does the following on every launch

trains a model on the corpus
then it applies the trained model to binaries which paths are in argv

The model is 1-NN on bigram and trigram frequencies as features.

So, the frequencies can be just serialized into a binary file and then loaded from that file.

Also, I wonder if XGBoost can give any better (needs measurement with crossvalidation (usually 10-fold, but more folds - more expensive the cv is, but gives more accurate results, in my previous research I used 24-fold), have you done any on your classifier?) results and more compact models (that can be transpiled into if-else C AST and then into machine code, there a tool from XGBoost authors transpiling into C (and some other langs) AST, and a tool by me transpiling into python AST).

Louis Granboulan · Answer 3 · Tue Jan 19 2021 06:13:02 GMT+0800 (China Standard Time)

OK. I understand your proposal of using pretrained models.
Please implement it and make some tests.
From my experience, the result is slower, because reading the corpus from disk and computing the frequencies takes less time than reading precomputed frequencies. I don't remember if I tried reading a xz-compressed frequency file.

I don't know about XGBoost. Note than in the case of cpu_rec, there are many architectures for which the corpus is very small. The consequence is that many results from machine learning don't apply to this case. My opinion is that only by experimenting one can see if the tool is improved.

PS: your summary of how the tool works is good, but I don't exactly use "bigram and trigram frequencies as features". The decision process is slightly different from standard feature-based classification, especially the way I decide to output None.

Raphaël Rigo · Answer 4 · Sat Jun 10 2023 21:47:30 GMT+0800 (China Standard Time)

In #18 I implemented saving the frequencies in a pickle file, which is ~100MB, and makes startup very fast

Louis Granboulan · Answer 5 · Mon Mar 11 2024 01:02:09 GMT+0800 (China Standard Time)

The use of pickle as proposed by trou is now included in the tool.