Memory Issue
gdet opened this issue · comments
Hello,
I have 28GB of text and I want to train from scratch. I have 4 GPUs (product: GV102 vendor: NVIDIA Corporation) and the algorithm crashes due to memory issues. I saw in your readme this
# My dataset is 230Gb and it doesn't fit in RAM, so each epoch is a random sample from it. That is why the loop.
while true
do
python run_lm_finetuning.py \
--output_dir=$OUTPUT \
--model_type=gpt2 \
--model_name_or_path=$OUTPUT \
--do_train \
--train_data_file=$TRAIN_FILE \
--per_gpu_train_batch_size $BS \
--save_steps=10000 \
--logging_steps=10 \
--fp16 \
--fp16_opt_level O2 \
--warmup_samples 16000 \
--learning_rate $LR \
--overwrite_output_dir \
--tokenizer_class YTEncoder \
--tokenizer_name bpe/yt.model \
--do_eval \
--evaluate_during_training \
--eval_steps 1000 \
--eval_data_file=./data/classic/valid \
--save_total_limit 30 \
--num_train_epochs 10.0 \
--unfreeze_level 0
sleep 1
done
So If I use these parameters to train the model when will it stop, since you say while true. Do you believe this will fix the memory problem? Also it uses eventually all the training data or it is random as you say?
Thank you
Hello,
Paste you error message, pls
[4828811.086861] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice,task=python3,pid=79331,uid=1005
[4828811.086936] Out of memory: Killed process 79331 (python3) total-vm:85663880kB, anon-rss:63639632kB, file-rss:71768kB, shmem-rss:10240kB, UID:1005 pgtables:128624kB oom_score_adj:0
My memory is:
total used free shared buff/cache available
Mem: 64273 8746 44776 6 10749 54870
Try to poke with this line of code
files = files[:10000]
My dataset was a collection of a small txt files. I sampled 10k files each run. You can try to lower that number and see if that helps.
I have a very big file. So I will try to break it down to smaller ones like your sample and test again the code and inform you. Thank you for your help
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.