mgrankin / ru_transformers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory Issue

gdet opened this issue · comments

commented

Hello,

I have 28GB of text and I want to train from scratch. I have 4 GPUs (product: GV102 vendor: NVIDIA Corporation) and the algorithm crashes due to memory issues. I saw in your readme this

 # My dataset is 230Gb and it doesn't fit in RAM, so each epoch is a random sample from it. That is why the loop.
while true
do
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$OUTPUT \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=10 \
    --fp16 \
    --fp16_opt_level O2 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --overwrite_output_dir \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --save_total_limit 30 \
    --num_train_epochs 10.0 \
    --unfreeze_level 0

sleep 1

done

So If I use these parameters to train the model when will it stop, since you say while true. Do you believe this will fix the memory problem? Also it uses eventually all the training data or it is random as you say?

Thank you

Hello,

Paste you error message, pls

commented
 [4828811.086861] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice,task=python3,pid=79331,uid=1005
 [4828811.086936] Out of memory: Killed process 79331 (python3) total-vm:85663880kB, anon-rss:63639632kB, file-rss:71768kB, shmem-rss:10240kB, UID:1005 pgtables:128624kB oom_score_adj:0

My memory is:

 total        used        free      shared  buff/cache   available
 Mem:          64273        8746       44776           6       10749       54870

Try to poke with this line of code
files = files[:10000]

My dataset was a collection of a small txt files. I sampled 10k files each run. You can try to lower that number and see if that helps.

commented

I have a very big file. So I will try to break it down to smaller ones like your sample and test again the code and inform you. Thank you for your help

commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.