Pre-training hangs
jaspock opened this issue · comments
I run bash examples/create_tokenizer.sh
and then bash examples/create_tokenizer.sh
, but the latter shows
IP address is localhost
Monolingual training files are: {'hi': 'examples/data/train.hi', 'en': 'examples/data/train.en', 'vi': 'examples/data/train.vi'}
Sharding files into 1 parts
For language: hi the total number of lines are: 18088 and number of lines per shard are: 18088
File for language hi has been sharded.
For language: en the total number of lines are: 18088 and number of lines per shard are: 18088
File for language en has been sharded.
For language: vi the total number of lines are: 18088 and number of lines per shard are: 18088
File for language vi has been sharded.
Sharding files into 1 parts
and then hangs without showing anything else. If I press ^C to cancel, the following traceback is shown:
File "pretrain_nmt.py", line 888, in <module>
run_demo()
File "pretrain_nmt.py", line 885, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 101, in join
timeout=timeout,
File "/home/user/.conda/envs/yanmtt/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/home/user/.conda/envs/yanmtt/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
I am running YANMTT in a Docker container on a machine with a GPU A100 40GB. The only dependency for which I am using a newer version is torch
, as the version in requirements.txt
is too old for my GPU.
Hi,
I am not entirely sure why this happens, but let me take a stab. It is most likely related to the --ipaddr flag and the line 884 in pretrain_nmt.py which is "os.environ['MASTER_PORT'] = '26023'".
It is possible that the default argument of --ipaddr as localhost may be an issue with docker. Or it might be the case that 26023 is a bad port which is already in use. Basically, it seems like the process is waiting for something. So playing with this may help.
Other than that I can suggest that you try outside a docker environment.
Hope this helps.
This issue seemed to be related to some incompatibilities between my CUDA and the versions of Tensorflow and/or Pytorch in requirements.txt
. I have it working now using Python 3.6.8, Pytorch 1.10.1 and TensorFlow 2.4.3.
Just in case this is useful to someone else, this is the relevant part of my current Dockerfile
:
FROM nvcr.io/nvidia/pytorch:20.12-py3
RUN apt-get update
RUN apt-get install -y wget tmux && rm -rf /var/lib/apt/lists/*
WORKDIR /setup
WORKDIR /app
RUN conda update conda
RUN conda create -n yanmtt python=3.6.8
SHELL ["conda", "run", "-n", "yanmtt", "/bin/bash", "-c"]
RUN git clone https://github.com/prajdabre/yanmtt
WORKDIR yanmtt
RUN pip install -r requirements.txt
WORKDIR transformers
RUN python setup.py install
RUN pip install tensorflow==2.4.3
SHELL ["/bin/bash", "-c"]
ENV PYTHONPATH=$PYTHONPATH:/app/yanmtt/transformers
RUN conda install -n yanmtt pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
WORKDIR /setup
RUN git clone --branch v0.1.95 https://github.com/google/sentencepiece.git
RUN mkdir sentencepiece/build
WORKDIR sentencepiece/build
RUN cmake .. && make -j 4
RUN make install && ldconfig -v
RUN echo 'eval "$(conda shell.bash hook)"' >>~/.bashrc && echo 'conda activate yanmtt' >>~/.bashrc
WORKDIR /app
Oh fantastic. Could you make a contrib folder in the examples folder and write down these points and then send a pull request? It would really help people.