torch: process 0 terminated with signal SIGKILL

Question

torch: process 0 terminated with signal SIGKILL

Season0518 opened this issue a year ago · comments

Why did torch throw an exception when I tried to reproduce the experimental results?
为什么我在尝试复现结果时Pytorch抛出了异常？

Traceback (most recent call last):
  File "main-multigpu.py", line 422, in <module>
    join=True)
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 146, in join
    signal_name=name
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
/home/season/miniconda3/envs/torch/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 15 leaked semaphores to clean up at shutdown
  len(cache))

Full Log
完整日志

(torch) season@SeasonPC:~/Prompt4NR-main/Continuous-Action$ ./run.sh
| distributed init rank 0
Namespace(batch_size=16, data_path='../DATA/MIND-Small', device='cuda', epochs=4, gpu=0, log=True, log_file='./log-Small/bs16-Tbs100-lr2e-05-n333-85460.txt', lr=2e-05, max_his=50, max_his_len=450, max_tokens=500, model_name='bert-base-uncased', model_save=True, num_conti1=3, num_conti2=3, num_conti3=3, num_negs=4, rank=0, save_dir='./model_save/2023-07-20-04-43-13', test_batch_size=100, wd=0.001, world_size=1)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Token indices sequence length is longer than the specified maximum sequence length for this model (577 > 512). Running this sequence through the model will result in indexing errors


Vocabulary size of tokenizer after adding new tokens : 30532
num train: 1122680      num val: 286521
[P1][P2][P3][nsep] joe biden reportedly denied communion at a south carolina church [nsep] former us senator kay hagan dead at 66 [nsep] robert evans chinatown producer and paramount chief dies at 89 [nsep] this wedding photo of a canine best man captures just [nsep] michigan sends breakup tweet to notre dame as series goes [nsep] four flight attendants were arrested in miamis airport after bringing [nsep] rosie odonnell barbara walters isnt up to speaking to people [nsep] three takeaways from yankees alcs game 5 victory over the [nsep] wheel of fortune guest delivers hilarious off the rails introduction[SEP][Q1][Q2][Q3]Charles Rogers former Michigan State football Detroit Lions star dead at 38[SEP][M1][M2][M3][MASK]
[P1][P2][P3][nsep] former trump adviser who testified to ukraine pressure campaign said [nsep] video captures terrifying moment woman slips at grand canyon [nsep] uptown bar sued by songwriters organization for allegedly failing to [nsep] garth brooks got called out by jimmy carter for taking [nsep] heidi klums 2019 halloween costume transformation is mindblowing [nsep] we all can get it wrong mike pompeo suggests william [nsep] cummings widow responds to trumps attacks gets standing ovation [nsep] mitch mcconnell snubbed by elijah cummings pallbearer in handshake line [nsep] as president trump says wheres the whistleblower [nsep] 9 signs of disease that are written all over your [nsep] the decisions that have backfired on the yankees in the [nsep] ohio voters express angst over impeachment [nsep] school district reverses transgenderfriendly bathroom policy amid death threats [nsep] fec chairwoman has concerns about foreign entities interfering in 2020 [nsep] frequent urination at night a sign of serious health problem [nsep] a texas mom is going to prison after putting her [nsep] hunter biden steps down from chinese board as trump attacks [nsep] texas police officer shoots woman to death inside her home [nsep] a kansas 13yearold was charged with a felony for pointing [nsep] sondland faces local backlash denying trump deal in ukraine [nsep] the latest powerful typhoon makes landfall in japan[SEP][Q1][Q2][Q3]How Russia Meddles Abroad for Profit Cash Trolls and a Cult Leader[SEP][M1][M2][M3][MASK]
--------------------------------------------------------------------
start training:  2023-07-20 04:51:16.053689
Epoch:  0
lr: 2e-05
  0%|                                                                                         | 0/70168 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main-multigpu.py", line 422, in <module>
    join=True)
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 146, in join
    signal_name=name
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
/home/season/miniconda3/envs/torch/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 15 leaked semaphores to clean up at shutdown
  len(cache))
| distributed init rank 0
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Token indices sequence length is longer than the specified maximum sequence length for this model (565 > 512). Running this sequence through the model will result in indexing errors
Namespace(data_path='../DATA/MIND-Small', device='cuda', log=True, log_file='./log-Test-Small/Tbs100-n333-19670.txt', max_his=50, max_his_len=450, max_tokens=500, model_file='./temp/BestModel.pt', model_name='bert-base-uncased', num_conti1=3, num_conti2=3, num_conti3=3, rank=0, test_batch_size=100, vocab_size=30532, world_size=1)
Vocabulary size of tokenizer after adding new tokens : 30532
[P1][P2][P3][nsep] donald trump jr reflects on explosive view chat i dont [nsep] the rocks gnarly palm is a testament to life without [nsep] alexandria ocasiocortez sincerely apologizes for blocking exbrooklyn politician on twitter [nsep] hundreds of thousands of people in california are downriver of [nsep] queen elizabeth finally had her dream photoshoot thanks to royal [nsep] felicity huffman smiles as she begins community service following prison [nsep] celebrity kids then and now see how theyve grown [nsep] bruce willis brought demi moore to tears after reading her [nsep] lori loughlin is absolutely terrified after being hit with new [nsep] this restored 1968 winnebago is beyond adorable [nsep] tiffanys is selling a holiday advent calendar for 112000 [nsep] outer banks storms unearth old shipwreck from graveyard of the [nsep] felicity huffman begins prison sentence for college admissions scam [nsep] hard rock hotel new orleans collapse former site engineer weighs [nsep] wheel of fortune guest delivers hilarious off the rails introduction[SEP][Q1][Q2][Q3]Opinion Colin Kaepernick is about to get what he deserves a chance[SEP][M1][M2][M3][MASK]
num test: 2658091
Traceback (most recent call last):
  File "predict.py", line 273, in <module>
    join=True)
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/season/Prompt4NR-main/Continuous-Action/predict.py", line 187, in ddp_main
    net.module.load_state_dict(torch.load(args.model_file, map_location=map_location))
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/season/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './temp/BestModel.pt'

ToolChain Version

Ubuntu 22.04
python==3.7.16
pytorch==1.13.0
cuda==116
transformers==4.27.0

(torch) season@SeasonPC:~/Prompt4NR-main/Continuous-Action$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

(torch) season@SeasonPC:~/Prompt4NR-main/Continuous-Action$ nvidia-smi
Thu Jul 20 05:14:52 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 511.23       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   52C    P0    13W / 120W |   2814MiB /  6144MiB |     12%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+

resistzzz · Answer 1 · Sun Sep 03 2023 16:06:57 GMT+0800 (China Standard Time)

'./temp/BestModel.pt'这是因为，没有完成训练，没有存储模型，所以test阶段没有模型不能进行推理。

“process 0 terminated with signal SIGKILL”这个问题，我只能说我没有遇到过，所以我也不知道是怎么回事.......

yasyang · Answer 2 · Mon Mar 11 2024 15:30:51 GMT+0800 (China Standard Time)

所以这个解决了吗，我也是碰到这个问题，改一堆都没有

Season0518 · Answer 3 · Fri Mar 15 2024 22:24:53 GMT+0800 (China Standard Time)

所以这个解决了吗，我也是碰到这个问题，改一堆都没有

邮件推送抽风了，没看见回复太抱歉了，对不起各位。

我最后的解决办法是换了A100进行训练，环境全部重装，在不行可以尝试改小batch size，虽然效果可能会变差。

transformer太吃显存了，个人推测可能是显卡不行，毕竟换个机子就好了。复现这篇论文是其他同学委托给我的任务，能做到正常训练就可以了，所以也就没继续深纠下去。如果有其他问题我再把commit打开。