kingoflolz / mesh-transformer-jax

Model parallel transformers in JAX and Haiku

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to run v3-128?

soneo1127 opened this issue · comments

Google TRC has provided me v3-128, but device_train.py only runs on v3-8
Do you know how to make it learn on v3-128?
Thank you.

I figured out that it probably uses train.py, but I get the following error.

2021-11-25 18:39:20,129 ERROR import_thread.py:88 -- ImportThread: Connection closed by server.
2021-11-25 18:39:20,132 ERROR worker.py:1125 -- listen_error_messages_raylet: Connection closed by server.
2021-11-25 18:39:20,134 ERROR worker.py:465 -- print_logs: Connection closed by server.
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Aborted (core dumped)

Sorry, I solved it myself.
I was able to do it by starting a new VM and running it from there, instead of running it from within v3-128.