salesforce / CodeGen

CodeGen is an open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A attempt to write finetuning script for codegen

vishalsingha opened this issue · comments

I am interested in writing finetuning script for this model.

Can anyone tell while training the model in what format can i provide output to the model.

As the data is of form
{code:"def hello(name): return f"Hello {name}", nl : "This function takes name as input and a massage saying hello to the person in format hello name."}

We can tokenize input (either code or text ) by the making tokenizer and passing text into it.

But what format should I use for training and how to compare loss of original output and predicted output.


Can anyone give me the small sample (only 4 to 5 examples ) of BIGPYTHON dataset used for training the Codegen-nl mono just to get idea of training set.

@vishalsingha wondering whether you have made any progress in this regard?

@vishalsingha do you plan to share it?

Sorry it was done as part of internship so according to company policies I can't share it.

@vishalsingha got it. Would be awesome if you share any insight how you approached developing the fine tuning script. Also any reference to open source resources would be much appreciated.