Unable to resume CoCondenser pretraining

Question

Unable to resume CoCondenser pretraining

eugene-yang opened this issue 3 years ago · comments

The model checkpoints seem to be hard-coded as the BertForMaskedLM and are unable to load but to the CoCondensor class.
Adding the following attributes in the initialization can surpass the exceptions but all the weights were not loaded.

self._keys_to_ignore_on_save = None
self._keys_to_ignore_on_load_missing = None

Is there a way to resume training after interruptions?
Thanks!

Luyu Gao · Answer 1 · Mon Dec 13 2021 13:19:35 GMT+0800 (China Standard Time)

Please elaborate on the issue. Include what you did, what worked and did not worked, error messages, etc.

Eugene Yang · Answer 2 · Tue Dec 14 2021 05:39:02 GMT+0800 (China Standard Time)

Here is the way to reproduce the exception.

I first start training from the model downloaded from huggingface.

HF_DATASETS_CACHE="/expscratch/eyang/cache/datasets" TOKENIZERS_PARALLELISM="false"\
  python run_co_pre_training.py \
  --output_dir ./test/bert-base-cased/ \
  --model_name_or_path bert-base-cased \
  --do_train \
  --fp16 \
  --save_steps 1 \
  --save_total_limit 10 \
  --model_type bert \
  --per_device_train_batch_size 256 \
  --cache_chunk_size 12 \
  --gradient_accumulation_steps 1 \
  --warmup_ratio 0.1 \
  --learning_rate 1e-5 \
  --num_train_epochs 8 \
  --dataloader_drop_last \
  --overwrite_output_dir \
  --dataloader_num_workers 10 \
  --n_head_layers 2 \
  --skip_from 6 \
  --max_seq_length 180 \
  --train_path ./processed_text/msmarco-document.span-90.tokenized-bert-base_incomplete.jsonl \
  --weight_decay 0.01 \
  --late_mlm

I tried to load the checkpoint from the first step. --

HF_DATASETS_CACHE="/expscratch/eyang/cache/datasets" TOKENIZERS_PARALLELISM="false"\
  python run_co_pre_training.py \
  --output_dir ./test/bert-base-cased/ \
  --model_name_or_path ./test/bert-base-cased/checkpoint-1 \
  --do_train \
  --fp16 \
  --save_steps 100 \
  --save_total_limit 10 \
  --model_type bert \
  --per_device_train_batch_size 256 \
  --cache_chunk_size 12 \
  --gradient_accumulation_steps 1 \
  --warmup_ratio 0.1 \
  --learning_rate 1e-5 \
  --num_train_epochs 8 \
  --dataloader_drop_last \
  --overwrite_output_dir \
  --dataloader_num_workers 10 \
  --n_head_layers 2 \
  --skip_from 6 \
  --max_seq_length 180 \
  --train_path ./processed_text/msmarco-document.span-90.tokenized-bert-base_incomplete.jsonl \
  --weight_decay 0.01 \
  --late_mlm

and here is the exception.

[INFO|tokenization_utils_base.py:1671] 2021-12-13 16:30:45,404 >> Didn't find file ./test/bert-base-cased/checkpoint-1/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file ./test/bert-base-cased/checkpoint-1/vocab.txt
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file ./test/bert-base-cased/checkpoint-1/tokenizer.json
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file None
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file ./test/bert-base-cased/checkpoint-1/special_tokens_map.json
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file ./test/bert-base-cased/checkpoint-1/tokenizer_config.json
[INFO|modeling_utils.py:1350] 2021-12-13 16:30:45,426 >> loading weights file ./test/bert-base-cased/checkpoint-1/pytorch_model.bin
[INFO|modeling_utils.py:1619] 2021-12-13 16:30:47,089 >> All model checkpoint weights were used when initializing BertForMaskedLM.

[INFO|modeling_utils.py:1627] 2021-12-13 16:30:47,089 >> All the weights of BertForMaskedLM were initialized from the model checkpoint at ./test/bert-base-cased/checkpoint-1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
12/13/2021 16:30:47 - INFO - modeling -   loading extra weights from local files
12/13/2021 16:30:47 - INFO - trainer -   Initializing Gradient Cache Trainer
[INFO|trainer.py:439] 2021-12-13 16:30:51,616 >> Using amp half precision backend
/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/transformers/trainer.py:1059: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead.
  warnings.warn(
[INFO|trainer.py:1089] 2021-12-13 16:30:51,618 >> Loading model from ./test/bert-base-cased/checkpoint-1).
Traceback (most recent call last):
  File "run_co_pre_training.py", line 227, in <module>
    main()
  File "run_co_pre_training.py", line 217, in main
    trainer.train(model_path=model_path)
  File "/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/transformers/trainer.py", line 1108, in train
    self._load_state_dict_in_model(state_dict)
  File "/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/transformers/trainer.py", line 1484, in _load_state_dict_in_model
    if self.model._keys_to_ignore_on_save is not None and set(load_result.missing_keys) == set(
  File "/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CoCondenserForPretraining' object has no attribute '_keys_to_ignore_on_save'

We can surpass this exception by adding the two flags here.
https://github.com/luyug/Condenser/blob/main/modeling.py#L177

And execute the same command above to, here are the warnings (clipped but basically all the layers)

[INFO|tokenization_utils_base.py:1671] 2021-12-13 16:34:43,748 >> Didn't find file ./test/bert-base-cased/checkpoint-1/added_tokens.json. We won't load it.                                                                                                                                                                                                                                   
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file ./test/bert-base-cased/checkpoint-1/vocab.txt                                                                                                                                                                                                                                                                  
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file ./test/bert-base-cased/checkpoint-1/tokenizer.json                                                                                                                                                                                                                                                             
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file None                                                                                                                                                                                                                                                                                                           
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file ./test/bert-base-cased/checkpoint-1/special_tokens_map.json                                                                                                                                                                                                                                                    
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file ./test/bert-base-cased/checkpoint-1/tokenizer_config.json                                                                                                                                                                                                                                                      
[INFO|modeling_utils.py:1350] 2021-12-13 16:34:43,770 >> loading weights file ./test/bert-base-cased/checkpoint-1/pytorch_model.bin                                                                                                                                                                                                                                                           
[INFO|modeling_utils.py:1619] 2021-12-13 16:34:45,435 >> All model checkpoint weights were used when initializing BertForMaskedLM.                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                              
[INFO|modeling_utils.py:1627] 2021-12-13 16:34:45,435 >> All the weights of BertForMaskedLM were initialized from the model checkpoint at ./test/bert-base-cased/checkpoint-1.                                                                                                                                                                                                                
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.                                                                                                                                                                                                                                 
12/13/2021 16:34:45 - INFO - modeling -   loading extra weights from local files                                                                                                                                                                                                                                                                                                              
12/13/2021 16:34:45 - INFO - trainer -   Initializing Gradient Cache Trainer                                                                                                                                                                                                                                                                                                                  
[INFO|trainer.py:439] 2021-12-13 16:34:49,899 >> Using amp half precision backend                                                                                                                                                                                                                                                                                                             
/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/transformers/trainer.py:1059: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead.                                                                                                                                                                    
  warnings.warn(                                                                                                                                                                                                                                                                                                                                                                              
[INFO|trainer.py:1089] 2021-12-13 16:34:49,901 >> Loading model from ./test/bert-base-cased/checkpoint-1).                                                                                                                                                                                                                                                                                    
[WARNING|trainer.py:1489] 2021-12-13 16:34:50,315 >> There were missing keys in the checkpoint model loaded: ['co_target', 'lm.bert.embeddings.position_ids', 'lm.bert.embeddings.word_embeddings.weight', 'lm.bert.embeddings.position_embeddings.weight', 'lm.bert.embeddings.token_type_embeddings.weight', 'lm.bert.embeddings.LayerNorm.weight', 'lm.bert.embeddings.LayerNorm.bias', 'lm
.bert.encoder.layer.0.attention.self.query.weight', 'lm.bert.encoder.layer.0.attention.self.query.bias', 'lm.bert.encoder.layer.0.attention.self.key.weight', 'lm.bert.encoder.layer.0.attention.self.key.bias', 'lm.bert.encoder.layer.0.attention.self.value.weight', 'lm.bert.encoder.layer.0.attention.self.value.bias', 'lm.bert.encoder.layer.0.attention.output.dense.weight', 'lm.bert
.encoder.layer.0.attention.output.dense.bias', 'lm.bert.encoder.layer.0.attention.output.LayerNorm.weight', 'lm.bert.encoder.layer.0.attention.output.LayerNorm.bias', 'lm.bert.encoder.layer.0.intermediate.dense.weight', 'lm.bert.encoder.layer.0.intermediate.dense.bias', 'lm.bert.encoder.layer.0.output.dense.weight', 'lm.bert.encoder.layer.0.output.dense.bias', 'lm.bert.encoder.la
yer.0.output.LayerNorm.weight', 'lm.bert.encoder.layer.0.output.LayerNorm.bias', 'lm.bert.encoder.layer.1.attention.self.query.weight', 'lm.bert.encoder.layer.1.attention.self.query.bias', 'lm.bert.encoder.layer.1.attention.self.key.weight', 'lm.bert.encoder.layer.1.attention.self.key.bias', 'lm.bert.encoder.layer.1.attention.self.value.weight', 'lm.bert.encoder.layer.1.attention
.self.value.bias', 'lm.bert.encoder.layer.1.attention.output.dense.weight', 'lm.bert.encoder.layer.1.attention.output.dense.bias', 'lm.bert.encoder.layer.1.attention.output.LayerNorm.weight', 'lm.bert.encoder.layer.1.attention.output.LayerNorm.bias', 'lm.bert.encoder.layer.1.intermediate.dense.weight', 'lm.bert.encoder.layer.1.intermediate.dense.bias', 'lm.bert.encoder.layer.1.ou
tput.dense.weight', 'lm.bert.encoder.layer.1.output.dense.bias', 'lm.bert.encoder.layer.1.output.LayerNorm.weight', 'lm.bert.encoder.layer.1.output.LayerNorm.bias', 'lm.bert.encoder.layer.2.attention.self.query.weight', 'lm.bert.encoder.layer.2.attention.self.query.bias', 'lm.bert.encoder.layer.2.attention.self.key.weight', 'lm.bert.encoder.layer.2.attention.self.key.bias', 'lm.b
ert.encoder.layer.2.attention.self.value.weight', 'lm.bert.encoder.layer.2.attention.self.value.bias', 'lm.bert.encoder.layer.2.attention.output.dense.weight', 'lm.bert.encoder.layer.2.attention.output.dense.bias', 'lm.bert.encoder.layer.2.attention.output.LayerNorm.weight', 'lm.bert.encoder.layer.2.attention.output.LayerNorm.bias', 'lm.bert.encoder.layer.2.intermediate.dense.wei
ght', 'lm.bert.encoder.layer.2.intermediate.dense.bias', 'lm.bert.encoder.layer.2.output.dense.weight', 'lm.bert.encoder.layer.2.output.dense.bias', 'lm.bert.encoder.layer.2.output.LayerNorm.weight', 'lm.bert.encoder.layer.2.output.LayerNorm.bias', 'lm.bert.encoder.layer.3.attention.self.query.weight', 'lm.bert.encoder.layer.3.attention.self.query.bias', 'lm.bert.encoder.layer.3.
attention.self.key.weight', 'lm.bert.encoder.layer.3.attention.self.key.bias', 'lm.bert.encoder.layer.3.attention.self.value.weight', 'lm.bert.encoder.layer.3.attention.self.value.bias', ...

Luyu Gao · Answer 3 · Tue Dec 14 2021 07:21:55 GMT+0800 (China Standard Time)

The attribute _keys_to_ignore_on_save is introduced in a relatively recent release of hf transformers. Maybe I should patch the repo but for now a few easy things you can do,

get a earlier version of transformers. I used 4.2.0 in my experiments.
Set model_path=None here.

Eugene Yang · Answer 4 · Tue Dec 14 2021 08:00:54 GMT+0800 (China Standard Time)

Thank you for the reply!
Isn't setting model_path=None basically telling the trainer to start from scratch and ignore the checkpoint?

Eugene Yang · Answer 5 · Tue Dec 14 2021 08:08:54 GMT+0800 (China Standard Time)

Would it makes more sense to put the path of the checkpoint we want to resume from at here (like ./test/bert-base-cased/checkpoint-1 in the example) and leave the rest of the model_name_or_path as the original model like bert-base-cased in the example?

Luyu Gao · Answer 6 · Tue Dec 14 2021 09:14:20 GMT+0800 (China Standard Time)

Thank you for the reply! Isn't setting model_path=None basically telling the trainer to start from scratch and ignore the checkpoint?

Yes, and the CoCondensr object will do the loading. You will see a log when it does so. Letting CoCondensr class do the loading makes sure that we can handle multiple load scenarios.

This is more or less a WAR. Eventually, I probably need to patch the CondenserPreTrainer class s.t. it will no longer load model weights.

Quentin · Answer 7 · Fri May 05 2023 15:25:28 GMT+0800 (China Standard Time)

Maybe I am missing something, but from what I can read, using model_path=None and loading the model from CondenserForPretraining is actually doing exactly the same thing as trainer.train(resume_from_checkpoint=model_args.model_name_or_path). You will just get rid of the warning but the loading should be exactly the same. If you print missing_keys from the custom from_pretrained classmethod of CondenserForPretraining, you'll see it contains the same keys that are logged in the warning.
Maybe ignoring those keys on save is a cleaner solution, but in the end, it should not change anything to the training.