karpathy / llama2.c

Inference Llama 2 in one file of pure C

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HOw to add different coropus ?

pure-water opened this issue · comments

i want to train something other than tiny stories. i have plain file list. how to train on them?

You can reuse the same code as is. Just upload your own json file containing the documents you want to train on. Make sure that in the tinystories.py you replace the

text = example["story"]

with

text = example["text"]

or whatever the name of your field is in your custom json.

Because this code only works with shards here's an example of how you'd split the jsons into shards:

import json

def split_json(input_file, output_prefix, n):
    with open(input_file, 'r') as file:
        data = json.load(file)

    # Determine the size of each shard
    shard_size = len(data) // n

    # Split the data into n shards
    shards = [data[i * shard_size:(i + 1) * shard_size] for i in range(n - 1)]
    shards.append(data[(n - 1) * shard_size:])

    # Write each shard to a separate file
    for i, shard_data in enumerate(shards):
        shard_file_path = f'{output_prefix}_{i + 1}.json'
        with open(shard_file_path, 'w') as shard_file:
            json.dump(shard_data, shard_file, indent=2)

# Example usage
input_json_file = 'llama2.c/data/TinyStories_all_data/custom.json'
output_prefix = 'llama2.c/data/TinyStories_all_data/shard'
num_shards = 10

split_json(input_json_file, output_prefix, num_shards)

You can reuse the same code as is. Just upload your own json file containing the documents you want to train on. Make sure that in the tinystories.py you replace the

text = example["story"]

with

text = example["text"]

or whatever the name of your field is in your custom json.

Because this code only works with shards here's an example of how you'd split the jsons into shards:

import json

def split_json(input_file, output_prefix, n):
    with open(input_file, 'r') as file:
        data = json.load(file)

    # Determine the size of each shard
    shard_size = len(data) // n

    # Split the data into n shards
    shards = [data[i * shard_size:(i + 1) * shard_size] for i in range(n - 1)]
    shards.append(data[(n - 1) * shard_size:])

    # Write each shard to a separate file
    for i, shard_data in enumerate(shards):
        shard_file_path = f'{output_prefix}_{i + 1}.json'
        with open(shard_file_path, 'w') as shard_file:
            json.dump(shard_data, shard_file, indent=2)

# Example usage
input_json_file = 'llama2.c/data/TinyStories_all_data/custom.json'
output_prefix = 'llama2.c/data/TinyStories_all_data/shard'
num_shards = 10

split_json(input_json_file, output_prefix, num_shards)

What do we mean by "upload" and to where?

Thanks. Actually I already figured out other way doing so where there are a lot of fun actually. It does not seem to be necessarily json by the way.

Thanks. Actually I already figured out other way doing so where there are a lot of fun actually. It does not seem to be necessarily json by the way.

It would be great if you can share how you're doing it with to this community, or perhaps count me in to your fork, I'd like to see how you got this up and running.

Coz you see, having tiny responses coming from a CPU can make our bulk of hardware useful. The rest of the globe are not yet into GPU.

I've tested it on traditional CPU cloud/clustered solutions and it works. But I'm a little confused about how to leverage it with a custom dataset or an improvement question-answering chat.

For the meantime, I'm spending my time on Ollama and Ollama Web-UI with my custom build and then perhaps, add this Baby LLama into my collection of models, to interface with the UI.

I'm sure Karpathy is very busy these days at OpenAI. So, its good to see others are taking a serious look at this project.

I'd like to encourage everyone to be open about their work since this pure C port is very promising, for the benefit of not just developers but everyone in the world.