Error when replicating Shakespeare training

Question

Error when replicating Shakespeare training

harrison-broadbent opened this issue 3 years ago · comments

Harrison Broadbent commented 3 years ago

Hey, I'm trying to train the Shakespeare model myself.

I'm following the Training part of the readme -

I have downloaded the Shakespeare folder from google drive and placed it in the datasets folder
I have downloaded the pretrained gpt2-large paraphrase and placed it in /style-transfer-paraphrase/style_paraphrase/saved_models

When I run run_finetune_shakespeare_0.sh I get the following error -

Traceback (most recent call last):
  File "style-transfer-paraphrase/style_paraphrase/run_lm_finetuning.py", line 505, in <module>
    main()
  File "style-transfer-paraphrase/style_paraphrase/run_lm_finetuning.py", line 417, in main
    train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
  File "style-transfer-paraphrase/style_paraphrase/run_lm_finetuning.py", line 72, in load_and_cache_examples
    split="dev" if evaluate else "train"
  File "/content/style-transfer-paraphrase/style_paraphrase/style_dataset.py", line 118, in __init__
    self.config = DATASET_CONFIG[data_dir]
KeyError: 'style-transfer-paraphrase/datasets/shakespeare'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1102) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

Full stacktrace -

/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  "The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : style-transfer-paraphrase/style_paraphrase/run_lm_finetuning.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_dl308gzb/none_zlb_0vv3
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_dl308gzb/none_zlb_0vv3/attempt_0/0/error.json
10/02/2021 07:32:05 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
10/02/2021 07:32:58 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='style-transfer-paraphrase/datasets/shakespeare', device=device(type='cuda', index=0), do_delete_old=False, do_eval=False, do_lower_case=False, do_train=True, eval_frequency_min=0, eval_patience=10, evaluate_during_training=True, evaluate_specific=None, extra_embedding_dim=768, fp16=False, fp16_opt_level='O1', global_dense_feature_list='none', gradient_accumulation_steps=2, job_id='shakespeare_0', learning_rate='5e-5', limit_examples=None, local_rank=0, logging_steps=20, max_grad_norm=1.0, max_steps=-1, model_name_or_path='gpt2-large', model_type='gpt2', n_gpu=1, no_cuda=False, num_train_epochs=3.0, optimizer='adam', output_dir='style-transfer-paraphrase/style_paraphrase/saved_models/model_shakespeare_0', overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=5, prefix_input_type='paraphrase_250', save_steps=500, save_total_limit=-1, seed=42, specific_style_train='0', target_style_override='none', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
Traceback (most recent call last):
  File "style-transfer-paraphrase/style_paraphrase/run_lm_finetuning.py", line 505, in <module>
    main()
  File "style-transfer-paraphrase/style_paraphrase/run_lm_finetuning.py", line 417, in main
    train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
  File "style-transfer-paraphrase/style_paraphrase/run_lm_finetuning.py", line 72, in load_and_cache_examples
    split="dev" if evaluate else "train"
  File "/content/style-transfer-paraphrase/style_paraphrase/style_dataset.py", line 118, in __init__
    self.config = DATASET_CONFIG[data_dir]
KeyError: 'style-transfer-paraphrase/datasets/shakespeare'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1102) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

I'm using google colab, I'm not sure if that might be causing some sort of issue?
Odds are though that I've just missed something - any help is much appreciated!

Heres a screenshot of my file tree on colab -

My finetune_shakespeare file -

#!/bin/sh
#SBATCH --job-name=finetune_gpt2_shakespeare_0
#SBATCH -o style_paraphrase/logs/log_shakespeare_0.txt
#SBATCH --time=167:00:00
#SBATCH --partition=m40-long
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=3
#SBATCH --mem=50GB
#SBATCH -d singleton

# Experiment Details :- GPT2-large model for shakespeare.
# Run Details :- accumulation = 2, batch_size = 5, beam_size = 1, cpus = 3, dataset = datasets/shakespeare, eval_batch_size = 1, global_dense_feature_list = none, gpu = m40, learning_rate = 5e-5, memory = 50, model_name = gpt2-large, ngpus = 1, num_epochs = 3, optimizer = adam, prefix_input_type = paraphrase_250, save_steps = 500, save_total_limit = -1, specific_style_train = 0, stop_token = eos

export DATA_DIR=style-transfer-paraphrase/datasets/shakespeare
BASE_DIR=style-transfer-paraphrase/style_paraphrase

python -m torch.distributed.launch --nproc_per_node=1 $BASE_DIR/run_lm_finetuning.py \
    --output_dir=$BASE_DIR/saved_models/model_shakespeare_0 \
    --model_type=gpt2 \
    --model_name_or_path=gpt2-large \
    --data_dir=$DATA_DIR \
    --do_train \
    --save_steps 500 \
    --logging_steps 20 \
    --save_total_limit -1 \
    --evaluate_during_training \
    --num_train_epochs 3 \
    --gradient_accumulation_steps 2 \
    --per_gpu_train_batch_size 5 \
    --job_id shakespeare_0 \
    --learning_rate 5e-5 \
    --prefix_input_type paraphrase_250 \
    --global_dense_feature_list none \
    --specific_style_train 0 \
    --optimizer adam

I'm feeling quite confused because I tried to follow the instructions as close as possible, yet still ended up with errors.
I'd really appreciate some help or pointers.

Thanks!

Kalpesh Krishna · Answer 1 · Mon Oct 04 2021 06:02:26 GMT+0800 (China Standard Time)

Hi @harrison-broadbent,
Thank you for your interest in the codebase, and sorry you are facing trouble. The error is due to this line.

Is it possible to cd into style-transfer-paraphrase and then run the codebase? That way the datasets folder will be datasets/shakespeare

Harrison Broadbent · Answer 2 · Wed Oct 06 2021 10:35:54 GMT+0800 (China Standard Time)

Pefect thank you so much!