train_model recover protocol
g-luo opened this issue · comments
I was wondering if train_model (from allennlp.commands.train import train_model
) recovers based on the step that the process died on, or if it recovers on the first step in the epoch?
The reason why I ask is because:
- I am running a process where I relaunch train_model with recover=true every 2 epochs because I have out of memory issues. Also keep in mind my config looks like the below for the DataLoader, where the total number of samples in my dataset is ~1M.
- My loss demonstrates clear overfitting, where blue is val and orange is train.
As a result, I feel like what might be happening is that train_model is recovering based on the first step in the epoch, so the model is only seeing the same ~120k samples out of the ~1M in the dataset, which results in overfitting. I would love if someone could confirm / give input on this. Thanks!
batch_size: 16
max_instances_in_memory: 8192
biggest_batch_first: false
instances_per_epoch: 65536
maximum_samples_per_batch: ["num_tokens", 16384]
ccing @zmykevin
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇