mgrankin / ru_transformers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New version of fast ai broke everything?

DevAlone opened this issue · comments

It seems that some new version of fast ai broke this repository. I'm getting an error

conda ImportError: cannot import name progress_bar from fastprogress

which disappears if I install fastai==1.0.59.

But there are still errors with different packages like transformers. Could you please send the output of your pip freeze inside the working environment?

Libraries change quickly. I don't have training env available anymore, but I've got working inference environment.
pip_freeze.txt

Now I'm getting this weird error:

Traceback (most recent call last):
File "run_lm_finetuning.py", line 662, in
main()
File "run_lm_finetuning.py", line 630, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 296, in train
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_initialize.py", line 171, in _initialize
check_params_fp32(models)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_initialize.py", line 93, in check_params_fp32
name, param.type()))
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_amp_state.py", line 32, in warn_or_err
raise RuntimeError(msg)
RuntimeError: Found param transformer.wte.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.

Cuda is installed(10.1).

What is you GPU?

NVIDIA GeForce GT 1030

> sudo lshw -C display
  *-display                 
       description: VGA compatible controller
       product: GP108 [GeForce GT 1030]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:38 memory:fd000000-fdffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:c0000-dffff

Probably the error is due to lack of support of mixed precision training in Pascal architecture. Try not using --fp16 parameter.

https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Many thanks. It seems to be learning, but now my 12Gb RAM is not enough. I ran it on very small dataset 2MB just to see that there's no errors. But I'm not really sure that I'm doing everything in correct way. Right now I have a dataset of comments' text(html is cleared), each comment on a new line. What I did:

1 yttm bpe --data ./some_comments_10000.without_html.txt --model bpe/yt.model --vocab_size 50257 --coverage 0.9999

some_comments_10000.without_html.txt has 10'000 comments. The actual dataset is >140mil comments(86 Gb of uncompressed text), but first I want to know that I'm doing everything right without waiting forever and facing some new error.

2

cd ru_transformers
export TRAIN_FILE=./some_comments_10000.without_html.txt
export CUDA_VISIBLE_DEVICES=1
export MODEL_SIZE=gpt2
export OUTPUT=output_yt/s
export BS=8
export LR=5e-5
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$MODEL_SIZE \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=1 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --unfreeze_level 0

3 Got out of memory error

No, I meant ram

cat /proc/meminfo                                                                                                                                                                         0 ms  
MemTotal:       12196764 kB

My GPU has 2Gb, yes:

nvidia-smi                                                                                                                                                                             1307 ms  
Wed Jan  1 18:52:39 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 1030     On   | 00000000:01:00.0  On |                  N/A |
| 40%   15C    P8    N/A /  30W |      1MiB /  1998MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Is this configuration ok to train on 86 Gb uncompressed dataset?

The dataset size isn't a problem, because it's been loaded by pieces.

The problem is the model size. GPT-2 is pretty big and 2Gb on GPU is not enough. I'm using Titan RTX with 24Gb. You can rent a decent GPU from GCP or AWS.

Hi Mikhail!
First, thx for sharing

I have runtime error, when run

<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
Traceback (most recent call last):
  File "run_lm_finetuning.py", line 662, in <module>
    main()
  File "run_lm_finetuning.py", line 630, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_lm_finetuning.py", line 289, in train
    scheduler = get_constant_schedule(optimizer, warmup_steps=warmup_steps)
TypeError: get_constant_schedule() got an unexpected keyword argument 'warmup_steps'

@pavelxx1, transformers did some breaking changes at some point huggingface/transformers#1837 (comment), so you need to either downgrade to that version(which I don't know) or(what I did) modify sources, if I remember correctly, I renamed get_constant_schedule to get_constant_schedule_with_warmup and warmup_steps to num_warmup_steps

@DevAlone, thx bro! You make my day!)) all ok

HI!)
@DevAlone or @mgrankin , How to know if model overfited?
I use: 117M model & 50257 vocab size and small dataset.txt (1.25Mb) & num_train_epochs=1000 & unfreeze_level -1
Training time was about 4hr and training score at the end was: MovingLoss=0.79, Perplexity=1.06
But... My generated sample was identical to text lines from dataset :( What I did wrong?

@pavelxx1
The only correct vay to detect overfitting is to use validation set. Creating a good validations set is one of the most important things in your ML project. You can read why here

https://www.fast.ai/2017/11/13/validation-sets/

Strangely, validation set sometimes completely overlooked. For example, popular repository for trainig GPT-2 on English poetry nshepperd/gpt-2 doesn’t use any validation set. And he probably severely overfits as you do.

If I haven’t confinced you to use validation set then just train for 3-5 epoch and hope that would be good.

I've updated the README. You shouldn't use --model_name_or_path=$MODEL_SIZE if you want to start with pre-trained Russian GPT-2. You should download the model, put it in the output dir and use --model_name_or_path=$OUTPUT

If you set --model_name_or_path=gpt2 you'll start with English GPT-2.

@mgrankin , thx I will test and write result

@pavelxx1 paste you command here.

@mgrankin All done!
It was my mistake, I figured it out.
thanks again!

commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.