karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

Is there any way to make customized dataset?

dongrixinyu opened this issue · comments

I have tested the example in tutorial by train_gpt2fp32cu. Here is the dataset file downloaded from huggingface.

    // read in the (optional) command line arguments
    const char* train_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_train.bin";
    const char* val_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_val.bin";

As you know, the data structure of the bin file is quite complicated and trivial. Is there any way to make customized dataset .bin file easily? from purely raw text dataset.