fauxpilot / fauxpilot

FauxPilot - an open-source alternative to GitHub Copilot server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to build my own dataset to fine-tune the model

zhangybuaa opened this issue · comments

I want to fine-tune the model with my own code data, how should I build the dataset. Are there any requirements for the format of the dataset, whether the data needs to be labeled and what format should it be labeled in. Can some guidance or examples be given, thanks!

The easiest way to fine-tune is to use the run_clm.py script from Huggingface. You can provide a dataset as a JSONLines file that looks like:

{ "name": "foo.py", "text": "content of foo.py" }
{ "name": "bar.py", "text": "content of bar.py" }

I have some more details on what arguments to pass to run_clm.py here: https://news.ycombinator.com/item?id=32331764