Exception: process 0 terminated with signal SIGSEGV

Question

Exception: process 0 terminated with signal SIGSEGV

1029694141 opened this issue 2 years ago · comments

Hi, i met a tricky problem on pretrain_nmt.py

my commond:

CUDA_VISIBLE_DEVICES=3 python pretrain_nmt.py -n 1 -nr 0 -g 1 --pretrained_model facebook/bart-base --use_official_pretrained --tokenizer_name_or_path facebook/bart-base --is_summarization --warmup_steps 500 --save_intermediate_checkpoints --mono_src /home/WwhStuGrp/yyfwwhstu16/yanmtt/dataset/pubmed/pubmed-dataset/train_fineshed.txt --monolingual_domains 1 --train_domains 1 --shard_files --batch_size 1024

here is Tracetrack:

 File "pretrain_nmt.py", line 968, in <module>
    run_demo()
  File "pretrain_nmt.py", line 965, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGSEGV

I find some ways but seems didn't work
include this way:
facebookresearch/fairseq#1720 (comment)

Any advice or solution?
Thanks u again for your work in this repo!!!

Raj Dabre · Answer 1 · Wed Aug 03 2022 12:55:20 GMT+0800 (China Standard Time)

Hi,

Can you tell me your use case? Are you planning to do summarization or pretraining? If it is summarization then it is train_nmt.py that you should probably be looking at.

I also recommend playing with learning rates and dropout etc.

Also I am not sure why you are using the following flags: --monolingual_domains, --is_summarization and --train_domains
They way you use it now is wrong and I dont think you need them. These flags were made for my own personal purposes. Never thought anyone would bother using them :)

Raj Dabre · Answer 2 · Wed Aug 03 2022 12:59:18 GMT+0800 (China Standard Time)

Oh and although it is not going to be used, please use --langs en so my code does not break

YYF · Answer 3 · Fri Aug 05 2022 14:00:31 GMT+0800 (China Standard Time)

Hi,

I am continue pertaining bart-base on myself corpus.

thanks for your advice!

my latest command still met "process 0 terminated with signal SIGSEGV" problem, it seems relate to cuda or memory .

my latest command as follows:
CUDA_VISIBLE_DEVICES=0 python pretrain_nmt.py -n 1 -nr 0 -g 1 --pretrained_model facebook/bart-base --use_official_pretrained --tokenizer_name_or_path facebook/bart-base --warmup_steps 500 --save_intermediate_checkpoints --mono_src /home/WwhStuGrp/yyfwwhstu16/yanmtt/dataset/pubmed/pubmed-dataset/train_fineshed.txt --shard_files --batch_size 1024 --langs en --lr 2e-3 --dropout 0.2

Raj Dabre · Answer 4 · Fri Aug 05 2022 14:02:55 GMT+0800 (China Standard Time)

Hi,

Please post the entire log and not a single line.
I have no idea what the error is at the current moment.

What GPU are you using?

YYF · Answer 5 · Fri Aug 05 2022 14:14:54 GMT+0800 (China Standard Time)

Hi ,here is the entire log:

(scibart) [yyfwwhstu16@gpu16 yanmtt]$ CUDA_VISIBLE_DEVICES=0 python pretrain_nmt.py -n 1  -nr 0 -g 1 --pretrained_model facebook/bart-base --use_official_pretrained --tokenizer_name_or_path facebook/bart-base --warmup_steps 500 --save_intermediate_checkpoints --mono_src /home/WwhStuGrp/yyfwwhstu16/yanmtt/dataset/pubmed/pubmed-dataset/train_fineshed.txt --shard_files --batch_size 1024 --langs en --lr 2e-3 --dropout 0.2 --port 26001
IP address is localhost
Monolingual training files are: {'en': '/home/WwhStuGrp/yyfwwhstu16/yanmtt/dataset/pubmed/pubmed-dataset/train_fineshed.txt'}
Sharding files into 1 parts
For language: en  the total number of lines are: 133721 and number of lines per shard are: 133721
File for language en has been sharded.
Sharding files into 1 parts
Tokenizer is: PreTrainedTokenizer(name_or_path='facebook/bart-base', vocab_size=50265, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})
Running DDP checkpoint example on rank 0.
We will do fp32 training
2022-08-05 14:06:00.234219: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:
2022-08-05 14:06:00.234275: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "pretrain_nmt.py", line 968, in <module>
    run_demo()
  File "pretrain_nmt.py", line 965, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGSEGV

and the nvidia-smi:


| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   41C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   44C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   42C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   41C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |

this environment can support other experiments like fine-tuning bart

Raj Dabre · Answer 6 · Fri Aug 05 2022 14:40:17 GMT+0800 (China Standard Time)

Ah,

Its your environment variable.

2022-08-05 14:06:00.234219: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:

Basically it means: in the path "/usr/local/cuda-10.1/lib64" the "libcudart.so.10.1'" file is not found.

A quick google should help you.

However one solution is to look in the lib64 folder and see if there is "libcudart.so.10.1'".
If there is "libcudart.so.10" then just do ln -s /usr/local/cuda-10.1/lib64/libcudart.so.10 /usr/local/cuda-10.1/lib64/libcudart.so.10.1

If this doesnt solve then please google it. Usually the problem is very simple.

YYF · Answer 7 · Sat Aug 06 2022 20:09:14 GMT+0800 (China Standard Time)

Hi，
sorry to bother you again.

I checked the environment variable with your advice.
Now it's ok about dynamic library:
2022-08-06 19:03:21.429157: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

But signal SIGSEGV problem still happened on 1900 batchs:

......
1800 7.96 21.8 seconds for 100 batches. Memory used post forward / backward passes: 3.51 / 3.17 GB.
1900 7.62 21.64 seconds for 100 batches. Memory used post forward / backward passes: 3.52 / 3.18 GB.
Traceback (most recent call last):
  File "pretrain_nmt.py", line 968, in <module>
    run_demo()
  File "pretrain_nmt.py", line 965, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGSEGV

One thing a noticed:
when I changed to fp16 training,signal SIGSEGV problem will delay to 2000 batches:

......
1900 7.83 23.69 seconds for 100 batches. Memory used post forward / backward passes: 3.59 / 3.12 GB.
Saving the model
Loading from checkpoint
2000 7.76 61.52 seconds for 100 batches. Memory used post forward / backward passes: 3.72 / 3.18 GB.
Traceback (most recent call last):
  File "pretrain_nmt.py", line 968, in <module>
    run_demo()
  File "pretrain_nmt.py", line 965, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGSEGV

Thank you again for following up on this issue！Best wishes to you!

Raj Dabre · Answer 8 · Sat Aug 06 2022 20:33:05 GMT+0800 (China Standard Time)

Hi,

Unfortunately this seems to be a cuda issue.
I tried googling this but the error message is non informative.

One solution I can think of is to use the exact python and cuda versions I specified in the requirements of my toolkit.

Another would be to see if the CPU ram gets overwhelmed or not. If your training file is too large and you don't have enough ram then this could happen.

Otherwise I have no idea how to help. :(