dmlc / gluon-nlp

Description

I am using 8 node Horovod BERT pretraining on gluonnlp==0.10.0, currently running for 10k steps and found that on each run the final loss and accuracy after 10k steps are not stabilized and mlm loss varies a lot.

So I am wondering if there is a stable way to generate loss and accuracy for pretraining

Error Message

There is 4 runs logs

[1,0]<stdout>:[step 10000]#011mlm_loss=2.76460#011mlm_acc=51.87#011nsp_loss= 0.08#011nsp_acc=96.88#011throughput=6.6K tks/s#011lr=0.0002000 time=0.57, latency=574.0 ms/step
[1,0]<stdout>:[step 10000]#011mlm_loss=1.66976#011mlm_acc=69.06#011nsp_loss= 0.07#011nsp_acc=96.88#011throughput=7.3K tks/s#011lr=0.0002000 time=0.54, latency=541.5 ms/step
[1,0]<stdout>:[step 10000]#011mlm_loss=2.61692#011mlm_acc=51.15#011nsp_loss= 0.02#011nsp_acc=100.00#011throughput=6.6K tks/s#011lr=0.0002000 time=0.53, latency=530.5 ms/step
[1,0]<stdout>:[step 10000]#011mlm_loss=2.25963#011mlm_acc=57.22#011nsp_loss= 0.11#011nsp_acc=93.75#011throughput=6.5K tks/s#011lr=0.0002000 time=0.55, latency=547.6 ms/step

To Reproduce

I am using Sagemaker to trigger Horovod, and shared config will be

    instance_count = 8
    instance_type = "ml.p3dn.24xlarge"
    SM_DATA_ROOT = '/opt/ml/input/data/train'
    hyperparameters={
        "data": '/'.join([SM_DATA_ROOT, 'bert/train']),
        "data_eval": '/'.join([SM_DATA_ROOT, 'bert/eval']),
        "ckpt_dir": '/'.join([SM_DATA_ROOT, 'ckpt_dir']),
        "comm_backend": "smddp",
        "model": "bert_24_1024_16",
        "total_batch_size": instance_count * 256,
        "total_batch_size_eval": instance_count * 256,
        "max_seq_length": 128,
        "max_predictions_per_seq": 20,
        'log_interval': 1,
        "lr": 0.0002,
        "num_steps": 10000,
        'warmup_ratio': 1,
        "raw": '',
    }
    distribution = {'mpi': {'enabled': True, "custom_mpi_options": "-verbose --NCCL_DEBUG=INFO"}}

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

At the moment there isn't a good way to enforce reproducibility when using distributed training due to the different compute orders. While in the current setting complete reproducibility is hard to obtain, potential ideas for reducing variance is to 1) initialize once and store the initial random weight, and then always train from that set of weights, and 2) enforce sample ordering in data loader by processing and storing processed samples beforehand.

@rondogency Has the training been stabilized after you have fixed the random seed?

@szha thanks for the answer!

@sxjscience yes, I am using what we have in gluoncv https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/random.py, now the loss is more stable now (less variance from multiple runs)

suggest to add same thing to gluonnlp and allow user to pass random seed to bert run_pretraining.py script

Thanks, we will ensure that all scripts use this set_seed function:

gluon-nlp/src/gluonnlp/utils/misc.py

Lines 187 to 191 in 09f3435

    
           def set_seed(seed): 
        
               import mxnet as mx 
        
               mx.random.seed(seed) 
        
               np.random.seed(seed) 
        
               random.seed(seed)

I'll close this issue for now as it should have been solved in the master version.
Feel free to reopen.

	def set_seed(seed):
	import mxnet as mx
	mx.random.seed(seed)
	np.random.seed(seed)
	random.seed(seed)