training stuck at validation step 1

Question

training stuck at validation step 1

Selimonder opened this issue 2 years ago · comments

Hello,

First of all, thank you for the great finetune guide. I followed this guide through and attempted fine-tuning GPT-J with 30-40MBs of a small dataset.

However, I am stuck at device_train.py step (which is 12th step).

The compiling of the train, eval, and network passes. Also, the first weights are written on the bucket.

it seems like the code is freezing at

out = network.eval(inputs) line under eval_step function. During compiling, data is passing through eval_step but when actual training starts it freezes.

Did anyone stumble upon a similar issue?

Javier de la Rosa · Answer 1 · Fri Nov 25 2022 22:46:20 GMT+0800 (China Standard Time)

Hi! I've trained GPT-J models in the past, but for some reason I'm now seeing this too. Did you manage to solve it?

Arina Puchkova · Answer 2 · Fri Dec 30 2022 00:51:13 GMT+0800 (China Standard Time)

Hey @versae, @Selimonder! Did you manage somehow to resolve this issue? Facing the same problem rn

Javier de la Rosa · Answer 3 · Fri Dec 30 2022 19:39:43 GMT+0800 (China Standard Time)

@rinapch for me the key was to select the alpha version when creating the TPU. Stable releases seem to break the implementation, not sure why.

Arina Puchkova · Answer 4 · Wed Jan 04 2023 18:50:55 GMT+0800 (China Standard Time)

Thanks a lot, @versae! It really did help 🥳