prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pre-training hangs

jaspock opened this issue · comments

I run bash examples/create_tokenizer.sh and then bash examples/create_tokenizer.sh, but the latter shows

IP address is localhost
Monolingual training files are: {'hi': 'examples/data/train.hi', 'en': 'examples/data/train.en', 'vi': 'examples/data/train.vi'}
Sharding files into 1 parts
For language: hi  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language hi has been sharded.
For language: en  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language en has been sharded.
For language: vi  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language vi has been sharded.
Sharding files into 1 parts

and then hangs without showing anything else. If I press ^C to cancel, the following traceback is shown:

  File "pretrain_nmt.py", line 888, in <module>
    run_demo()
  File "pretrain_nmt.py", line 885, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 101, in join
    timeout=timeout,
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)

I am running YANMTT in a Docker container on a machine with a GPU A100 40GB. The only dependency for which I am using a newer version is torch, as the version in requirements.txt is too old for my GPU.

Hi,

I am not entirely sure why this happens, but let me take a stab. It is most likely related to the --ipaddr flag and the line 884 in pretrain_nmt.py which is "os.environ['MASTER_PORT'] = '26023'".

It is possible that the default argument of --ipaddr as localhost may be an issue with docker. Or it might be the case that 26023 is a bad port which is already in use. Basically, it seems like the process is waiting for something. So playing with this may help.

Other than that I can suggest that you try outside a docker environment.

Hope this helps.

This issue seemed to be related to some incompatibilities between my CUDA and the versions of Tensorflow and/or Pytorch in requirements.txt. I have it working now using Python 3.6.8, Pytorch 1.10.1 and TensorFlow 2.4.3.

Just in case this is useful to someone else, this is the relevant part of my current Dockerfile:

FROM nvcr.io/nvidia/pytorch:20.12-py3

RUN apt-get update
RUN apt-get install -y wget tmux && rm -rf /var/lib/apt/lists/*

WORKDIR /setup
WORKDIR /app

RUN conda update conda
RUN conda create -n yanmtt python=3.6.8

SHELL ["conda", "run", "-n", "yanmtt", "/bin/bash", "-c"]
RUN git clone https://github.com/prajdabre/yanmtt
WORKDIR yanmtt
RUN pip install -r requirements.txt
WORKDIR transformers
RUN python setup.py install
RUN pip install tensorflow==2.4.3
SHELL ["/bin/bash", "-c"]

ENV PYTHONPATH=$PYTHONPATH:/app/yanmtt/transformers
RUN conda install -n yanmtt pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge

WORKDIR /setup
RUN git clone --branch v0.1.95 https://github.com/google/sentencepiece.git
RUN mkdir sentencepiece/build 
WORKDIR sentencepiece/build
RUN cmake .. && make -j 4
RUN make install && ldconfig -v

RUN echo 'eval "$(conda shell.bash hook)"' >>~/.bashrc && echo 'conda activate yanmtt' >>~/.bashrc

WORKDIR /app

Oh fantastic. Could you make a contrib folder in the examples folder and write down these points and then send a pull request? It would really help people.