Rather than do not load "ckpt-300000", use --model_name_or_path facebook/bart-large

Question

Rather than do not load "ckpt-300000", use --model_name_or_path facebook/bart-large

Mrwangkedong opened this issue a year ago · comments

i dont use pre-trained ckpt-300000, set model_name_or_path = facebook/bart-large.

When using the pre-trained model (ckpt-300000) , it runs normally. However, when I don't include the 'load_from' parameter and use the official 'bart-large' model provided by the code snippet below,

model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path) model.adjust_model(config) if (config.use_prefix > 0): model.add_prefix_module(config)

I encountered the following error, which I found difficult to resolve even after searching online resources.

ERROR:

terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f29166d17d2 in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f2969768daa in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f296976b390 in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f296976c625 in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: + 0xc71f (0x7f29cb1d871f in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x7ea5 (0x7f29d7dd6ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f29d73f68dd in /lib64/libc.so.6)

Traceback (most recent call last):
File "run_seq2seq.py", line 758, in
main()
File "run_seq2seq.py", line 753, in main
train(args, training_features, valid_features, model, tokenizer)
File "run_seq2seq.py", line 279, in train
model_outputs = model(**inputs)
File "/public/home/kd/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/kd/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/public/home/kd/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/kd/kdwang/UniSumm/unisumm/nlg-finetune/s2s_ft/modeling_bart.py", line 2091, in forward
outputs = self.model(
File "/public/home/kd/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/kd/kdwang/UniSumm/unisumm/nlg-finetune/s2s_ft/modeling_bart.py", line 1727, in forward
encoder_outputs = self.encoder(
File "/public/home/kd/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/kd/kdwang/UniSumm/unisumm/nlg-finetune/s2s_ft/modeling_bart.py", line 1264, in forward
prefix_embeds = self.embed_prefix(prefix_ids) * self.embed_scale
File "/public/home/kd/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/kd/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/public/home/kd/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fcc3834f7d2 in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fcc8b3e6daa in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fcc8b3e9390 in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fcc8b3ea625 in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: + 0xc71f (0x7fccece5671f in /public/home/kongfang/anaconda3/envs/unisumm/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x7ea5 (0x7fccf9a54ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fccf90748dd in /lib64/libc.so.6)

cylnlp · Answer 1 · Mon Feb 27 2023 21:32:59 GMT+0800 (China Standard Time)

Hi @Mrwangkedong , I may not fully understand your question.

The command provided in the git is to prefix-tune unisumm using general prefix (as described in the paper).

As you say "use the official 'bart-large' model provided by the code snippet below", do you mean that you try to use naive prefix-tuning to train a BART-model?

Tony Wang · Answer 2 · Mon Feb 27 2023 21:44:41 GMT+0800 (China Standard Time)

yes,

Hi @Mrwangkedong , I may not fully understand your question.

The command provided in the git is to prefix-tune unisumm using general prefix (as described in the paper).

As you say "use the official 'bart-large' model provided by the code snippet below", do you mean that you try to use naive prefix-tuning to train a BART-model?

yes! In train.sh, i set "--load_from $LOAD_FROM" to " # --load_from $LOAD_FROM"

cylnlp · Answer 3 · Mon Feb 27 2023 21:58:48 GMT+0800 (China Standard Time)

Hi @Mrwangkedong , could you try to set "--task YOUR_TASK"?

Tony Wang · Answer 4 · Tue Feb 28 2023 09:56:38 GMT+0800 (China Standard Time)

YES! It is Runing! Thank you very much.!