microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Home Page:https://aka.ms/GeneralAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(Still Wrong) Solved by the P&R at https://github.com/microsoft/unilm/pull/670#event-6324997330

AdrienGuo opened this issue · comments

Hi,
I think the issue hasn't been fixed, I still got the same error.
I tried the encoder.pkl downloaded from https://cdn.openai.com/dall-e/encoder.pkl and https://conversationhub.blob.core.windows.net/beit-share-public/dall-e_vae/encoder.pkl, but both of them got the same error below. (I pretrained on the CIFAR100 dataset)

Traceback (most recent call last):
  File "run_beit_pretraining.py", line 280, in <module>
    main(opts)
  File "run_beit_pretraining.py", line 175, in main
    d_vae = utils.create_d_vae(
  File "/workspace/beit/beit/utils.py", line 532, in create_d_vae
    return get_dalle_vae(weight_path, image_size, device)
  File "/workspace/beit/beit/utils.py", line 541, in get_dalle_vae
    vae.load_model(model_dir=weight_path, device=device)
  File "/workspace/beit/beit/modeling_discrete_vae.py", line 216, in load_model
    self.encoder = load_model(os.path.join(model_dir, "encoder.pkl"), device)
  File "/workspace/beit/beit/dall_e/__init__.py", line 18, in load_model
    return torch.load(f, map_location=device)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 595, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 764, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '-'.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'run_beit_pretraining.py', '--local_rank=0', '--data_path', './data/cifar100png', '--output_dir', './checkpoint/', '--num_mask_patches', '75', '--model', 'beit_base_patch16_224_8k_vocab', '--discrete_vae_weight_path', './tokenizer/', '--batch_size', '128', '--lr', '1.5e-3', '--warmup_steps', '10000', '--epochs', '150', '--clip_grad', '3.0', '--drop_path', '0.1', '--layer_scale_init_value', '0.1']' returned non-zero exit status 1.

The training script is

python -m torch.distributed.launch run_beit_pretraining.py \
--data_path ./data/cifar100png --output_dir ./checkpoint/ --num_mask_patches 75 \
--model beit_base_patch16_224_8k_vocab --discrete_vae_weight_path ./tokenizer/ \
--batch_size 128 --lr 1.5e-3 --warmup_steps 10000 --epochs 150 \
--clip_grad 3.0 --drop_path 0.1 --layer_scale_init_value 0.1

I only have one gpu card, so I disabled the distributed mode.

It turns out my fault.

I don't know why after I wget -o $TOKENIZER_PATH/encoder.pkl https://conversationhub.blob.core.windows.net/beit-share-public/dall-e_vae/encoder.pkl, it generated two files "encoder.pkl" and "encoder.pkl.1". It's pretty wierd (If any body know how to deal with this issue, just send me an email, appreciate!)
The "encoder.pkl.1" is the correct one, my code is pretraining now, thank you!!