Please Help: Regarding fine tuning

Question

Please Help: Regarding fine tuning

abhishekdhiman25 opened this issue 5 months ago · comments

Hi Reader,

I wish you are well. I was trying to understand fine-tuning part from "fine_tune_multi_label.ipynb" notebook.
Few Questions:
Q 1. - I want to know what is the order of 50 ATT&CK Labels defined under CLASSES Variable.
Q 2. - Why is it recommend not to change the code of particular cell.
Q 3. - If somebody wants to change the classes to fine tune model on some other ATT&CK labels, what is the correct method to do
so and in what order the labels should be placed.
Q 4. - If somebody wants to increase number of classes what is the correct approach.

Thanks for your support in advance

For Reference CLASSES:
CLASSES = [
'T1003.001', 'T1005', 'T1012', 'T1016', 'T1021.001', 'T1027',
'T1033', 'T1036.005', 'T1041', 'T1047', 'T1053.005', 'T1055',
'T1056.001', 'T1057', 'T1059.003', 'T1068', 'T1070.004',
'T1071.001', 'T1072', 'T1074.001', 'T1078', 'T1082', 'T1083',
'T1090', 'T1095', 'T1105', 'T1106', 'T1110', 'T1112', 'T1113',
'T1140', 'T1190', 'T1204.002', 'T1210', 'T1218.011', 'T1219',
'T1484.001', 'T1518.001', 'T1543.003', 'T1547.001', 'T1548.002',
'T1552.001', 'T1557.001', 'T1562.001', 'T1564.001', 'T1566.001',
'T1569.002', 'T1570', 'T1573.001', 'T1574.002'
]

Mark E. Haase · Answer 1 · Mon Mar 18 2024 23:24:14 GMT+0800 (China Standard Time)

Hi @abhishekdhiman25,

Q1 - They are in lexical order, but the order is somewhat arbitrary. The order of the classes affects how the labels are vectorized, i.e. turned from strings like "T1003.001" into dense vectors. E.g. the vector [1, 0, 0, 0, 0, ....] means that the associated technique is the first item in CLASSES: T1003.001.
Q2 - The notebook says not to modify that cell because we have already fine-tuned SciBERT using that vectorization scheme. This notebook is intended for continuing to fine tune with additional training data for the same set of labels. If you change the order of the labels, then additional fine tuning will be counter-productive, because the model has to relearn what each position in the label vector represents.
Q3 - If you want to fine tune SciBERT using different labels, you should look at the model-development/train_multi_label.ipynb notebook. That notebook illustrates how to start with an upstream SciBERT checkpoint and fine-tune it on the training data in data/tram2-data/multi_label.json.
Q4 - Same as for Q3. You'll want to set up MITRE Annotation Toolkit for labeling your additional training data. See: https://github.com/center-for-threat-informed-defense/tram/wiki/Data-Annotation

I hope this helps!