train_model recover protocol

Question

train_model recover protocol

g-luo opened this issue 2 years ago · comments

I was wondering if train_model (from allennlp.commands.train import train_model) recovers based on the step that the process died on, or if it recovers on the first step in the epoch?

The reason why I ask is because:

I am running a process where I relaunch train_model with recover=true every 2 epochs because I have out of memory issues. Also keep in mind my config looks like the below for the DataLoader, where the total number of samples in my dataset is ~1M.
My loss demonstrates clear overfitting, where blue is val and orange is train.

As a result, I feel like what might be happening is that train_model is recovering based on the first step in the epoch, so the model is only seeing the same ~120k samples out of the ~1M in the dataset, which results in overfitting. I would love if someone could confirm / give input on this. Thanks!

 batch_size: 16
 max_instances_in_memory: 8192
 biggest_batch_first: false
 instances_per_epoch: 65536
 maximum_samples_per_batch: ["num_tokens", 16384]

ccing @zmykevin

github-actions · Answer 1 · Sat Jun 18 2022 00:09:50 GMT+0800 (China Standard Time)

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇