(Still Wrong) Solved by the P&R at https://github.com/microsoft/unilm/pull/670#event-6324997330
AdrienGuo opened this issue · comments
Hi,
I think the issue hasn't been fixed, I still got the same error.
I tried the encoder.pkl downloaded from https://cdn.openai.com/dall-e/encoder.pkl and https://conversationhub.blob.core.windows.net/beit-share-public/dall-e_vae/encoder.pkl, but both of them got the same error below. (I pretrained on the CIFAR100 dataset)
Traceback (most recent call last):
File "run_beit_pretraining.py", line 280, in <module>
main(opts)
File "run_beit_pretraining.py", line 175, in main
d_vae = utils.create_d_vae(
File "/workspace/beit/beit/utils.py", line 532, in create_d_vae
return get_dalle_vae(weight_path, image_size, device)
File "/workspace/beit/beit/utils.py", line 541, in get_dalle_vae
vae.load_model(model_dir=weight_path, device=device)
File "/workspace/beit/beit/modeling_discrete_vae.py", line 216, in load_model
self.encoder = load_model(os.path.join(model_dir, "encoder.pkl"), device)
File "/workspace/beit/beit/dall_e/__init__.py", line 18, in load_model
return torch.load(f, map_location=device)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 595, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 764, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '-'.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'run_beit_pretraining.py', '--local_rank=0', '--data_path', './data/cifar100png', '--output_dir', './checkpoint/', '--num_mask_patches', '75', '--model', 'beit_base_patch16_224_8k_vocab', '--discrete_vae_weight_path', './tokenizer/', '--batch_size', '128', '--lr', '1.5e-3', '--warmup_steps', '10000', '--epochs', '150', '--clip_grad', '3.0', '--drop_path', '0.1', '--layer_scale_init_value', '0.1']' returned non-zero exit status 1.
The training script is
python -m torch.distributed.launch run_beit_pretraining.py \
--data_path ./data/cifar100png --output_dir ./checkpoint/ --num_mask_patches 75 \
--model beit_base_patch16_224_8k_vocab --discrete_vae_weight_path ./tokenizer/ \
--batch_size 128 --lr 1.5e-3 --warmup_steps 10000 --epochs 150 \
--clip_grad 3.0 --drop_path 0.1 --layer_scale_init_value 0.1
I only have one gpu card, so I disabled the distributed mode.
It turns out my fault.
I don't know why after I wget -o $TOKENIZER_PATH/encoder.pkl https://conversationhub.blob.core.windows.net/beit-share-public/dall-e_vae/encoder.pkl
, it generated two files "encoder.pkl" and "encoder.pkl.1". It's pretty wierd (If any body know how to deal with this issue, just send me an email, appreciate!)
The "encoder.pkl.1" is the correct one, my code is pretraining now, thank you!!