New version of fast ai broke everything?

Question

New version of fast ai broke everything?

DevAlone opened this issue 5 years ago · comments

It seems that some new version of fast ai broke this repository. I'm getting an error

conda ImportError: cannot import name progress_bar from fastprogress

which disappears if I install fastai==1.0.59.

But there are still errors with different packages like transformers. Could you please send the output of your pip freeze inside the working environment?

Mikhail Grankin · Answer 1 · Wed Jan 01 2020 15:16:25 GMT+0800 (China Standard Time)

Libraries change quickly. I don't have training env available anymore, but I've got working inference environment.
pip_freeze.txt

DevAlone · Answer 2 · Wed Jan 01 2020 21:57:26 GMT+0800 (China Standard Time)

Now I'm getting this weird error:

Traceback (most recent call last):
File "run_lm_finetuning.py", line 662, in
main()
File "run_lm_finetuning.py", line 630, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 296, in train
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_initialize.py", line 171, in _initialize
check_params_fp32(models)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_initialize.py", line 93, in check_params_fp32
name, param.type()))
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_amp_state.py", line 32, in warn_or_err
raise RuntimeError(msg)
RuntimeError: Found param transformer.wte.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.

Cuda is installed(10.1).

Mikhail Grankin · Answer 3 · Wed Jan 01 2020 23:06:32 GMT+0800 (China Standard Time)

What is you GPU?

DevAlone · Answer 4 · Wed Jan 01 2020 23:25:28 GMT+0800 (China Standard Time)

NVIDIA GeForce GT 1030

> sudo lshw -C display
  *-display                 
       description: VGA compatible controller
       product: GP108 [GeForce GT 1030]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:38 memory:fd000000-fdffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:c0000-dffff

Mikhail Grankin · Answer 5 · Wed Jan 01 2020 23:39:27 GMT+0800 (China Standard Time)

Probably the error is due to lack of support of mixed precision training in Pascal architecture. Try not using --fp16 parameter.

https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

DevAlone · Answer 6 · Thu Jan 02 2020 01:31:41 GMT+0800 (China Standard Time)

Many thanks. It seems to be learning, but now my 12Gb RAM is not enough. I ran it on very small dataset 2MB just to see that there's no errors. But I'm not really sure that I'm doing everything in correct way. Right now I have a dataset of comments' text(html is cleared), each comment on a new line. What I did:

1 yttm bpe --data ./some_comments_10000.without_html.txt --model bpe/yt.model --vocab_size 50257 --coverage 0.9999

some_comments_10000.without_html.txt has 10'000 comments. The actual dataset is >140mil comments(86 Gb of uncompressed text), but first I want to know that I'm doing everything right without waiting forever and facing some new error.

2

cd ru_transformers
export TRAIN_FILE=./some_comments_10000.without_html.txt
export CUDA_VISIBLE_DEVICES=1
export MODEL_SIZE=gpt2
export OUTPUT=output_yt/s
export BS=8
export LR=5e-5
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$MODEL_SIZE \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=1 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --unfreeze_level 0

3 Got out of memory error

Mikhail Grankin · Answer 7 · Thu Jan 02 2020 02:26:14 GMT+0800 (China Standard Time)

Are you sure you've got 12Gb?

https://www.geforce.com/hardware/desktop-gpus/geforce-gt-1030/specifications

DevAlone · Answer 8 · Thu Jan 02 2020 02:54:17 GMT+0800 (China Standard Time)

No, I meant ram

cat /proc/meminfo                                                                                                                                                                         0 ms  
MemTotal:       12196764 kB

My GPU has 2Gb, yes:

nvidia-smi                                                                                                                                                                             1307 ms  
Wed Jan  1 18:52:39 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 1030     On   | 00000000:01:00.0  On |                  N/A |
| 40%   15C    P8    N/A /  30W |      1MiB /  1998MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Is this configuration ok to train on 86 Gb uncompressed dataset?

Mikhail Grankin · Answer 9 · Thu Jan 02 2020 14:12:34 GMT+0800 (China Standard Time)

The dataset size isn't a problem, because it's been loaded by pieces.

The problem is the model size. GPT-2 is pretty big and 2Gb on GPU is not enough. I'm using Titan RTX with 24Gb. You can rent a decent GPU from GCP or AWS.

reserved · Answer 10 · Sat Jan 04 2020 04:24:23 GMT+0800 (China Standard Time)

Hi Mikhail!
First, thx for sharing

I have runtime error, when run

<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
Traceback (most recent call last):
  File "run_lm_finetuning.py", line 662, in <module>
    main()
  File "run_lm_finetuning.py", line 630, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_lm_finetuning.py", line 289, in train
    scheduler = get_constant_schedule(optimizer, warmup_steps=warmup_steps)
TypeError: get_constant_schedule() got an unexpected keyword argument 'warmup_steps'

DevAlone · Answer 11 · Sat Jan 04 2020 04:37:49 GMT+0800 (China Standard Time)

@pavelxx1, transformers did some breaking changes at some point huggingface/transformers#1837 (comment), so you need to either downgrade to that version(which I don't know) or(what I did) modify sources, if I remember correctly, I renamed get_constant_schedule to get_constant_schedule_with_warmup and warmup_steps to num_warmup_steps

reserved · Answer 12 · Sat Jan 04 2020 04:50:21 GMT+0800 (China Standard Time)

@DevAlone, thx bro! You make my day!)) all ok

reserved · Answer 13 · Sun Jan 05 2020 03:29:32 GMT+0800 (China Standard Time)

HI!)
@DevAlone or @mgrankin , How to know if model overfited?
I use: 117M model & 50257 vocab size and small dataset.txt (1.25Mb) & num_train_epochs=1000 & unfreeze_level -1
Training time was about 4hr and training score at the end was: MovingLoss=0.79, Perplexity=1.06
But... My generated sample was identical to text lines from dataset :( What I did wrong?

Mikhail Grankin · Answer 14 · Sun Jan 05 2020 18:12:50 GMT+0800 (China Standard Time)

@pavelxx1
The only correct vay to detect overfitting is to use validation set. Creating a good validations set is one of the most important things in your ML project. You can read why here

https://www.fast.ai/2017/11/13/validation-sets/

Strangely, validation set sometimes completely overlooked. For example, popular repository for trainig GPT-2 on English poetry nshepperd/gpt-2 doesn’t use any validation set. And he probably severely overfits as you do.

If I haven’t confinced you to use validation set then just train for 3-5 epoch and hope that would be good.

Mikhail Grankin · Answer 15 · Sun Jan 05 2020 20:11:28 GMT+0800 (China Standard Time)

I've updated the README. You shouldn't use --model_name_or_path=$MODEL_SIZE if you want to start with pre-trained Russian GPT-2. You should download the model, put it in the output dir and use --model_name_or_path=$OUTPUT

Mikhail Grankin · Answer 16 · Sun Jan 05 2020 20:12:29 GMT+0800 (China Standard Time)

If you set --model_name_or_path=gpt2 you'll start with English GPT-2.

reserved · Answer 17 · Sun Jan 05 2020 23:11:11 GMT+0800 (China Standard Time)

@mgrankin , thx I will test and write result

Mikhail Grankin · Answer 18 · Mon Jan 06 2020 03:14:42 GMT+0800 (China Standard Time)

@pavelxx1 paste you command here.

reserved · Answer 19 · Mon Jan 06 2020 06:21:50 GMT+0800 (China Standard Time)

@mgrankin All done!
It was my mistake, I figured it out.
thanks again!

stale · Answer 20 · Fri Mar 06 2020 07:00:42 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.