How to build my own dataset to fine-tune the model

Question

How to build my own dataset to fine-tune the model

zhangybuaa opened this issue a year ago · comments

I want to fine-tune the model with my own code data, how should I build the dataset. Are there any requirements for the format of the dataset, whether the data needs to be labeled and what format should it be labeled in. Can some guidance or examples be given, thanks！

Brendan Dolan-Gavitt · Answer 1 · Sat Nov 26 2022 00:54:02 GMT+0800 (China Standard Time)

The easiest way to fine-tune is to use the run_clm.py script from Huggingface. You can provide a dataset as a JSONLines file that looks like:

{ "name": "foo.py", "text": "content of foo.py" }
{ "name": "bar.py", "text": "content of bar.py" }

I have some more details on what arguments to pass to run_clm.py here: https://news.ycombinator.com/item?id=32331764