train.txt format

Question

train.txt format

AAnirudh07 opened this issue a year ago · comments

Hey @chrisociepa, awesome repo! Could you pls shed some light on how the train.txt file should look like?

Thank you!

VldmrB · Answer 1 · Sun Mar 12 2023 12:45:50 GMT+0800 (China Standard Time)

No personal experience, but It's likely whatever kind of data you want to train/fine-tune on. It's up to you what kind of format exactly it will be in, but keep in mind that the data will get tokenized.

In nanoGPT by Andrej Karpathy, which is a major inspiration for this repo, by the looks of it, Shakespeare is used as a toy example:
https://github.com/karpathy/nanoGPT/blob/0d8fbd11aed59617f65d2bbd14842b4050516128/data/shakespeare/prepare.py#L9

AAnirudh07 · Answer 2 · Mon Mar 13 2023 08:42:16 GMT+0800 (China Standard Time)

This was super helpful, thank you!