Not an issue: Asking for help

Question

Not an issue: Asking for help

Hjertesvikt opened this issue 25 days ago · comments

Padirac Innovation commented 25 days ago

Hi all and thanks for this marvelous piece of code.
I tried to wet my feet at LLMs so I sought to train a new LLM with this code on an old Ubuntu with an old Nvidia card GTX 750.
I managed to create a giant JSON file (yet I mixed up key and value) from 250 MB of text.
Then I split this giant JSON in 10 shards.
As I uploaded manually the shards on my computer, there was no need for:
python tinystories.py download

Then (after some trials and errors and customizing tinystories.py to run on CPU only) I ran:
python tinystories.py train_vocab --vocab_size=4096
python tinystories.py pretokenize --vocab_size=4096

The computer complains if I run without --vocab_size=4096

The computer at last didn't complain so I got a file tok4096.model in data folder.

Then I ran:
python train.py --vocab_source=custom --vocab_size=4096

This is where I got this error:
jplr@jplr-Station:~/Documents/llama2.c$ python3 train.py --vocab_source=custom --vocab_size=4096
Download done.
Number of shards: 10
Example story:
{'text': 'story', 'label': '###24491034. background\tthe emergence of hiv as a chronic condition means that people living with hiv are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments .'}
Overriding: vocab_source = custom
Overriding: vocab_size = 4096
tokens per iteration will be: 131,072
breaks down as: 4 grad accum steps * 1 processes * 128 batch size * 256 max seq len
Initializing a new model from scratch
num decayed parameter tensors: 43, with 7,151,616 parameters
num non-decayed parameter tensors: 13, with 3,744 parameters
using fused AdamW: False
compiling the model... (takes a ~minute)
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
step 0: train loss 8.3395, val loss 8.3400
Traceback (most recent call last):
File "train.py", line 324, in
scaler.scale(loss).backward()
File "/home/jplr/.local/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 203, in scale
assert outputs.is_cuda or outputs.device.type == "xla"
AssertionError
jplr@jplr-Station:~/Documents/llama2.c$

Any suggestion would be greatly appreciated!

Padirac Innovation · Answer 1 · Fri May 10 2024 15:03:38 GMT+0800 (China Standard Time)

I think it's because I wanted to train on CPU, not GPU as my GPU is not very powerful.
I removed Torch, and installed torch-cpu and it seems to work now.
I wait for a few more minutes and if training works I will close this issue.

Padirac Innovation · Answer 2 · Fri May 10 2024 15:21:12 GMT+0800 (China Standard Time)

So the problem was that I wanted to train on CPU but I had the usual version of Torch which requires a GPU.
The solution was to remove this Torch version and instead install the version of Torch for CPU only.