when running python train_gpt2.py, errors out after 10 iteration -- is this normal?

Question

when running python train_gpt2.py, errors out after 10 iteration -- is this normal?

JamesHuang2004 opened this issue 2 years ago · comments

(base) billhuang@bh-m1-max llm.c % python train_gpt2.py
using device: mps
loading weights from pretrained gpt: gpt2
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 95.7kB/s]
loading cached tokens in data/tiny_shakespeare_val.bin
/Users/billhuang/TEST/llm.c/train_gpt2.py:333: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:205.)
tokens = torch.from_numpy(tokens)
wrote gpt2_124M.bin
wrote gpt2_124M_debug_state.bin
iteration 0, loss: 5.270007133483887
iteration 1, loss: 4.059707164764404
iteration 2, loss: 3.375124931335449
iteration 3, loss: 2.8007795810699463
iteration 4, loss: 2.3153889179229736
iteration 5, loss: 1.849020004272461
iteration 6, loss: 1.3946489095687866
iteration 7, loss: 0.9991437196731567
iteration 8, loss: 0.6240723729133606
iteration 9, loss: 0.376505047082901
Traceback (most recent call last):
File "/Users/billhuang/TEST/llm.c/train_gpt2.py", line 380, in
y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
File "/Users/billhuang/miniforge3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/billhuang/TEST/llm.c/train_gpt2.py", line 202, in generate
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
RuntimeError: Currently topk on mps works only for k<=16

James Huang · Answer 1 · Wed Apr 10 2024 00:10:57 GMT+0800 (China Standard Time)

I was able to do the rest steps afer this in readme.md, so I assume this is by design?

Sasank Chilamkurthy · Answer 2 · Wed Apr 10 2024 00:43:06 GMT+0800 (China Standard Time)

Consider upgrading torch.

Gautam Kumar · Answer 3 · Wed Apr 10 2024 05:04:50 GMT+0800 (China Standard Time)

Reported this previously #8

Gautam Kumar · Answer 4 · Wed Apr 10 2024 05:08:03 GMT+0800 (China Standard Time)

yes upgrading pytorch-2.2 worked fine. @chsasank do you know what caused this overflow error in old version of pytorch ?

James Huang · Answer 5 · Wed Apr 10 2024 07:21:58 GMT+0800 (China Standard Time)

Thanks guys. I just upgrade torch, but when rerunning the command, getting an abortion and some complaint. Is this normal?

Installing collected packages: mpmath, typing-extensions, sympy, networkx, torch, torchvision, torchaudio
Attempting uninstall: typing-extensions
Found existing installation: typing_extensions 4.4.0
Uninstalling typing_extensions-4.4.0:
Successfully uninstalled typing_extensions-4.4.0
Attempting uninstall: torch
Found existing installation: torch 1.13.1
Uninstalling torch-1.13.1:
Successfully uninstalled torch-1.13.1
Attempting uninstall: torchvision
Found existing installation: torchvision 0.14.1
Uninstalling torchvision-0.14.1:
Successfully uninstalled torchvision-0.14.1
Attempting uninstall: torchaudio
Found existing installation: torchaudio 0.13.1
Uninstalling torchaudio-0.13.1:
Successfully uninstalled torchaudio-0.13.1
Successfully installed mpmath-1.3.0 networkx-3.2.1 sympy-1.12 torch-2.2.2 torchaudio-2.2.2 torchvision-0.17.2 typing-extensions-4.11.0
(base) billhuang@bh-m1-max llm.c % pip show torch
Name: torch
Version: 2.2.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /Users/billhuang/miniforge3/lib/python3.9/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: torchaudio, torchvision
(base) billhuang@bh-m1-max llm.c % python train_gpt2.py
OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
zsh: abort python train_gpt2.py

Gautam Kumar · Answer 6 · Wed Apr 10 2024 07:30:52 GMT+0800 (China Standard Time)

you seems to be in some base env. you need to double check if all your required packages are installed and path are setup correctly. There is also Hint section for above which you can use.

James Huang · Answer 7 · Wed Apr 10 2024 07:43:34 GMT+0800 (China Standard Time)

yes I am using a base environment, this is by design, however, due to the fear that additional mess might be introduced here. For this specific issue, what could be the problem?

(base) billhuang@bh-m1-max llm.c % python train_gpt2.py
OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
zsh: abort python train_gpt2.py

James Huang · Answer 8 · Thu Apr 11 2024 07:42:44 GMT+0800 (China Standard Time)

close due to lack of further response