Memory Issue

Question

Memory Issue

gdet opened this issue 4 years ago · comments

Hello,

I have 28GB of text and I want to train from scratch. I have 4 GPUs (product: GV102 vendor: NVIDIA Corporation) and the algorithm crashes due to memory issues. I saw in your readme this

 # My dataset is 230Gb and it doesn't fit in RAM, so each epoch is a random sample from it. That is why the loop.
while true
do
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$OUTPUT \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=10 \
    --fp16 \
    --fp16_opt_level O2 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --overwrite_output_dir \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --save_total_limit 30 \
    --num_train_epochs 10.0 \
    --unfreeze_level 0

sleep 1

done

So If I use these parameters to train the model when will it stop, since you say while true. Do you believe this will fix the memory problem? Also it uses eventually all the training data or it is random as you say?

Thank you

Mikhail Grankin · Answer 1 · Thu Oct 15 2020 21:18:38 GMT+0800 (China Standard Time)

Hello,

Paste you error message, pls

gdet · Answer 2 · Thu Oct 15 2020 21:31:57 GMT+0800 (China Standard Time)

 [4828811.086861] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice,task=python3,pid=79331,uid=1005
 [4828811.086936] Out of memory: Killed process 79331 (python3) total-vm:85663880kB, anon-rss:63639632kB, file-rss:71768kB, shmem-rss:10240kB, UID:1005 pgtables:128624kB oom_score_adj:0

My memory is:

 total        used        free      shared  buff/cache   available
 Mem:          64273        8746       44776           6       10749       54870

Mikhail Grankin · Answer 3 · Fri Oct 16 2020 15:28:09 GMT+0800 (China Standard Time)

Try to poke with this line of code
files = files[:10000]

My dataset was a collection of a small txt files. I sampled 10k files each run. You can try to lower that number and see if that helps.

gdet · Answer 4 · Fri Oct 16 2020 22:30:56 GMT+0800 (China Standard Time)

I have a very big file. So I will try to break it down to smaller ones like your sample and test again the code and inform you. Thank you for your help

stale · Answer 5 · Wed Dec 16 2020 01:59:12 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.