New version of fast ai broke everything?
DevAlone opened this issue · comments
It seems that some new version of fast ai broke this repository. I'm getting an error
conda ImportError: cannot import name progress_bar from fastprogress
which disappears if I install fastai==1.0.59
.
But there are still errors with different packages like transformers
. Could you please send the output of your pip freeze
inside the working environment?
Libraries change quickly. I don't have training env available anymore, but I've got working inference environment.
pip_freeze.txt
Now I'm getting this weird error:
Traceback (most recent call last):
File "run_lm_finetuning.py", line 662, in
main()
File "run_lm_finetuning.py", line 630, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 296, in train
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_initialize.py", line 171, in _initialize
check_params_fp32(models)
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_initialize.py", line 93, in check_params_fp32
name, param.type()))
File "/home/user/anaconda3/envs/gpt/lib/python3.7/site-packages/apex/amp/_amp_state.py", line 32, in warn_or_err
raise RuntimeError(msg)
RuntimeError: Found param transformer.wte.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.
Cuda is installed(10.1).
What is you GPU?
NVIDIA GeForce GT 1030
> sudo lshw -C display
*-display
description: VGA compatible controller
product: GP108 [GeForce GT 1030]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:38 memory:fd000000-fdffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:c0000-dffff
Probably the error is due to lack of support of mixed precision training in Pascal architecture. Try not using --fp16
parameter.
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Many thanks. It seems to be learning, but now my 12Gb RAM is not enough. I ran it on very small dataset 2MB just to see that there's no errors. But I'm not really sure that I'm doing everything in correct way. Right now I have a dataset of comments' text(html is cleared), each comment on a new line. What I did:
1 yttm bpe --data ./some_comments_10000.without_html.txt --model bpe/yt.model --vocab_size 50257 --coverage 0.9999
some_comments_10000.without_html.txt has 10'000 comments. The actual dataset is >140mil comments(86 Gb of uncompressed text), but first I want to know that I'm doing everything right without waiting forever and facing some new error.
2
cd ru_transformers
export TRAIN_FILE=./some_comments_10000.without_html.txt
export CUDA_VISIBLE_DEVICES=1
export MODEL_SIZE=gpt2
export OUTPUT=output_yt/s
export BS=8
export LR=5e-5
python run_lm_finetuning.py \
--output_dir=$OUTPUT \
--model_type=gpt2 \
--model_name_or_path=$MODEL_SIZE \
--do_train \
--train_data_file=$TRAIN_FILE \
--per_gpu_train_batch_size $BS \
--save_steps=10000 \
--logging_steps=1 \
--warmup_samples 16000 \
--learning_rate $LR \
--tokenizer_class YTEncoder \
--tokenizer_name bpe/yt.model \
--do_eval \
--evaluate_during_training \
--eval_steps 1000 \
--eval_data_file=./data/classic/valid \
--unfreeze_level 0
3 Got out of memory error
Are you sure you've got 12Gb?
https://www.geforce.com/hardware/desktop-gpus/geforce-gt-1030/specifications
No, I meant ram
cat /proc/meminfo 0 ms
MemTotal: 12196764 kB
My GPU has 2Gb, yes:
nvidia-smi 1307 ms
Wed Jan 1 18:52:39 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 1030 On | 00000000:01:00.0 On | N/A |
| 40% 15C P8 N/A / 30W | 1MiB / 1998MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Is this configuration ok to train on 86 Gb uncompressed dataset?
The dataset size isn't a problem, because it's been loaded by pieces.
The problem is the model size. GPT-2 is pretty big and 2Gb on GPU is not enough. I'm using Titan RTX with 24Gb. You can rent a decent GPU from GCP or AWS.
Hi Mikhail!
First, thx for sharing
I have runtime error, when run
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
Traceback (most recent call last):
File "run_lm_finetuning.py", line 662, in <module>
main()
File "run_lm_finetuning.py", line 630, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 289, in train
scheduler = get_constant_schedule(optimizer, warmup_steps=warmup_steps)
TypeError: get_constant_schedule() got an unexpected keyword argument 'warmup_steps'
@pavelxx1, transformers
did some breaking changes at some point huggingface/transformers#1837 (comment), so you need to either downgrade to that version(which I don't know) or(what I did) modify sources, if I remember correctly, I renamed get_constant_schedule
to get_constant_schedule_with_warmup
and warmup_steps
to num_warmup_steps
HI!)
@DevAlone or @mgrankin , How to know if model overfited?
I use: 117M model & 50257 vocab size and small dataset.txt (1.25Mb) & num_train_epochs=1000 & unfreeze_level -1
Training time was about 4hr
and training score at the end was: MovingLoss=0.79, Perplexity=1.06
But... My generated sample was identical to text lines from dataset :( What I did wrong?
@pavelxx1
The only correct vay to detect overfitting is to use validation set. Creating a good validations set is one of the most important things in your ML project. You can read why here
https://www.fast.ai/2017/11/13/validation-sets/
Strangely, validation set sometimes completely overlooked. For example, popular repository for trainig GPT-2 on English poetry nshepperd/gpt-2 doesn’t use any validation set. And he probably severely overfits as you do.
If I haven’t confinced you to use validation set then just train for 3-5 epoch and hope that would be good.
I've updated the README. You shouldn't use --model_name_or_path=$MODEL_SIZE
if you want to start with pre-trained Russian GPT-2. You should download the model, put it in the output dir and use --model_name_or_path=$OUTPUT
If you set --model_name_or_path=gpt2
you'll start with English GPT-2.
@pavelxx1 paste you command here.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.