salesforce / CodeGen

CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A question about the detail of data preprocessing

zhengzzj opened this issue · comments

Hello!

I would like to finetune the model, and during the part of data preprocessing. I saw that in line 33 of the file https://github.com/salesforce/jaxformer/blob/main/preprocess/1_split_raw.py, the code is args.data_bucket_path = '/tmp/dataset_v1/ 0_raw/train.txt'.

I would like to know what kind of data is in the file train.txt? Is all the code data to be trained put into this train.txt file?