chrisociepa / allamo

Simple, hackable and fast implementation for training/finetuning medium-sized LLaMA-based models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

train.txt format

AAnirudh07 opened this issue · comments

Hey @chrisociepa, awesome repo! Could you pls shed some light on how the train.txt file should look like?

Thank you!

commented

No personal experience, but It's likely whatever kind of data you want to train/fine-tune on. It's up to you what kind of format exactly it will be in, but keep in mind that the data will get tokenized.

In nanoGPT by Andrej Karpathy, which is a major inspiration for this repo, by the looks of it, Shakespeare is used as a toy example:
https://github.com/karpathy/nanoGPT/blob/0d8fbd11aed59617f65d2bbd14842b4050516128/data/shakespeare/prepare.py#L9

This was super helpful, thank you!