allenai / allennlp

An open-source NLP research library, built on PyTorch.

Home Page:http://www.allennlp.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

train_model recover protocol

g-luo opened this issue · comments

I was wondering if train_model (from allennlp.commands.train import train_model) recovers based on the step that the process died on, or if it recovers on the first step in the epoch?

The reason why I ask is because:

  • I am running a process where I relaunch train_model with recover=true every 2 epochs because I have out of memory issues. Also keep in mind my config looks like the below for the DataLoader, where the total number of samples in my dataset is ~1M.
  • My loss demonstrates clear overfitting, where blue is val and orange is train.

As a result, I feel like what might be happening is that train_model is recovering based on the first step in the epoch, so the model is only seeing the same ~120k samples out of the ~1M in the dataset, which results in overfitting. I would love if someone could confirm / give input on this. Thanks!

 batch_size: 16
 max_instances_in_memory: 8192
 biggest_batch_first: false
 instances_per_epoch: 65536
 maximum_samples_per_batch: ["num_tokens", 16384]

Screen Shot 2022-06-02 at 10 42 37 AM

ccing @zmykevin

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇