ShannonAI / mrc-for-flat-nested-ner

Code for ACL 2020 paper `A Unified MRC Framework for Named Entity Recognition`

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8

1234565556 opened this issue · comments

Some weights of the model checkpoint at /home/sunpeng/AXJ/MRC/bert/bert-base-uncased were not used when initializing BertQueryNER: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']

  • This IS expected if you are initializing BertQueryNER from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertQueryNER from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of BertQueryNER were not initialized from the model checkpoint at /home/sunpeng/AXJ/MRC/bert/bert-base-uncased and are newly initialized: ['span_embedding.classifier1.weight', 'end_outputs.bias', 'span_embedding.classifier2.weight', 'span_embedding.classifier2.bias', 'end_outputs.weight', 'span_embedding.classifier1.bias', 'start_outputs.bias', 'start_outputs.weight']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    /home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Checkpoint directory /home/sunpeng/AXJ/MRC/outputs/ace2005/warmup0lr2e-5_drop0.3_norm1.0_weight0.1_warmup0_maxlen128 exists and is not empty with save_top_k != 0.All files in this directory will be deleted when a checkpoint is saved!
    warnings.warn(*args, **kwargs)
    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    CUDA_VISIBLE_DEVICES: [0,1]
    Using native 16bit precision.
    Some weights of the model checkpoint at /home/sunpeng/AXJ/MRC/bert/bert-base-uncased were not used when initializing BertQueryNER: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
  • This IS expected if you are initializing BertQueryNER from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertQueryNER from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of BertQueryNER were not initialized from the model checkpoint at /home/sunpeng/AXJ/MRC/bert/bert-base-uncased and are newly initialized: ['end_outputs.weight', 'span_embedding.classifier2.weight', 'end_outputs.bias', 'start_outputs.bias', 'span_embedding.classifier1.weight', 'span_embedding.classifier1.bias', 'span_embedding.classifier2.bias', 'start_outputs.weight']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    Using native 16bit precision.
    initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
    initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
    Traceback (most recent call last):
    File "/home/sunpeng/AXJ/MRC//train/mrc_ner_trainer.py", line 429, in
    main()
    File "/home/sunpeng/AXJ/MRC//train/mrc_ner_trainer.py", line 416, in main
    trainer.fit(model)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 164, in ddp_train
    self.trainer.is_slurm_managing_tasks
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 908, in init_ddp_connection
    torch_distrib.init_process_group(torch_backend, rank=global_rank, world_size=world_size)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
    RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
    Traceback (most recent call last):
    File "/home/sunpeng/AXJ/MRC/train/mrc_ner_trainer.py", line 429, in
    main()
    File "/home/sunpeng/AXJ/MRC/train/mrc_ner_trainer.py", line 416, in main
    trainer.fit(model)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit
    self.accelerator_backend.train(model)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train
    self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 164, in ddp_train
    self.trainer.is_slurm_managing_tasks
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 908, in init_ddp_connection
    torch_distrib.init_process_group(torch_backend, rank=global_rank, world_size=world_size)
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
    File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
    RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8