HOw to add different coropus ?

Question

HOw to add different coropus ?

pure-water opened this issue 9 months ago · comments

i want to train something other than tiny stories. i have plain file list. how to train on them?

Borjan Peovski · Answer 1 · Wed Dec 13 2023 00:37:42 GMT+0800 (China Standard Time)

You can reuse the same code as is. Just upload your own json file containing the documents you want to train on. Make sure that in the tinystories.py you replace the

text = example["story"]

with

text = example["text"]

or whatever the name of your field is in your custom json.

Because this code only works with shards here's an example of how you'd split the jsons into shards:

import json

def split_json(input_file, output_prefix, n):
    with open(input_file, 'r') as file:
        data = json.load(file)

    # Determine the size of each shard
    shard_size = len(data) // n

    # Split the data into n shards
    shards = [data[i * shard_size:(i + 1) * shard_size] for i in range(n - 1)]
    shards.append(data[(n - 1) * shard_size:])

    # Write each shard to a separate file
    for i, shard_data in enumerate(shards):
        shard_file_path = f'{output_prefix}_{i + 1}.json'
        with open(shard_file_path, 'w') as shard_file:
            json.dump(shard_data, shard_file, indent=2)

# Example usage
input_json_file = 'llama2.c/data/TinyStories_all_data/custom.json'
output_prefix = 'llama2.c/data/TinyStories_all_data/shard'
num_shards = 10

split_json(input_json_file, output_prefix, num_shards)

Oliver Bob Lagumen · Answer 2 · Sun Jan 14 2024 13:45:11 GMT+0800 (China Standard Time)

You can reuse the same code as is. Just upload your own json file containing the documents you want to train on. Make sure that in the tinystories.py you replace the

text = example["story"]

with

text = example["text"]

or whatever the name of your field is in your custom json.

Because this code only works with shards here's an example of how you'd split the jsons into shards:
import json

def split_json(input_file, output_prefix, n):
    with open(input_file, 'r') as file:
        data = json.load(file)

    # Determine the size of each shard
    shard_size = len(data) // n

    # Split the data into n shards
    shards = [data[i * shard_size:(i + 1) * shard_size] for i in range(n - 1)]
    shards.append(data[(n - 1) * shard_size:])

    # Write each shard to a separate file
    for i, shard_data in enumerate(shards):
        shard_file_path = f'{output_prefix}_{i + 1}.json'
        with open(shard_file_path, 'w') as shard_file:
            json.dump(shard_data, shard_file, indent=2)

# Example usage
input_json_file = 'llama2.c/data/TinyStories_all_data/custom.json'
output_prefix = 'llama2.c/data/TinyStories_all_data/shard'
num_shards = 10

split_json(input_json_file, output_prefix, num_shards)

What do we mean by "upload" and to where?

pure-water · Answer 3 · Tue Jan 16 2024 08:10:48 GMT+0800 (China Standard Time)

Thanks. Actually I already figured out other way doing so where there are a lot of fun actually. It does not seem to be necessarily json by the way.

Oliver Bob Lagumen · Answer 4 · Wed Jan 17 2024 16:28:16 GMT+0800 (China Standard Time)

Thanks. Actually I already figured out other way doing so where there are a lot of fun actually. It does not seem to be necessarily json by the way.

It would be great if you can share how you're doing it with to this community, or perhaps count me in to your fork, I'd like to see how you got this up and running.

Coz you see, having tiny responses coming from a CPU can make our bulk of hardware useful. The rest of the globe are not yet into GPU.

I've tested it on traditional CPU cloud/clustered solutions and it works. But I'm a little confused about how to leverage it with a custom dataset or an improvement question-answering chat.

For the meantime, I'm spending my time on Ollama and Ollama Web-UI with my custom build and then perhaps, add this Baby LLama into my collection of models, to interface with the UI.

I'm sure Karpathy is very busy these days at OpenAI. So, its good to see others are taking a serious look at this project.

I'd like to encourage everyone to be open about their work since this pure C port is very promising, for the benefit of not just developers but everyone in the world.