k2-fsa / icefall

We are training zipformer large-scaled config on ~20k hours of data and are seeing regular spikes in all of the metrics. Is this expected or is caused by bad data in the dataset? The tb is as follows:

The training command being used is -

python train.py \
  --world-size 4 \
  --num-epochs 40 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir runs/exp-large \
  --causal 0 \
  --num-encoder-layers 2,2,4,5,4,2 \
  --feedforward-dim 512,768,1536,2048,1536,768 \
  --encoder-dim 192,256,512,768,512,256 \
  --encoder-unmasked-dim 192,192,256,320,256,192 \
  --inf-check True

Will this majorly hurt the final model(s)? If there is some bad (misaligned) data in the dataset, how do we infer the inverted spikes in val loss, shouldn't they be similar to train loss? Also the validation set is 500k files or about 500 hours, is this a bit too much?

Have you shuffled your training data?

Please see also

icefall/egs/librispeech/ASR/prepare.sh

Line 150 in eaab2c8

shuf | gzip -c > data/fbank/librispeech_cuts_train-all-shuf.jsonl.gz

Have you shuffled your training data?

No, the data is not shuffled but the dataloader does shuffle the data:

train_sampler = DynamicBucketingSampler(
            train_cuts_arg,
            max_duration = 1200.0,
            shuffle=True,
            num_buckets=10,
        )

Is this relevant?

If possible could you do a global shuffle by following the example I just posted?

DynamicBucketingSampler only shuffles data in a buffer, whose size is determined by yourself.

I will refire the training after doing a global shuffle and share the tensorboard here, but one problem is that each epoch takes 8 hours on 4 a100s with 1200s max duration, is there a way to optimize this process and make the training faster?

On the dataloader side, you can do several things: increase num buckets to 30 or even 50; and set quadratic_duration to 10-15 which will allow you to set max duration much higher.

On the dataloader side, you can do several things: increase num buckets to 30 or even 50; and set quadratic_duration to 10-15 which will allow you to set max duration much higher.

@pzelasko
Will we also be required to adjust lr params with this change? The dataset we're using is 20k hours. In this case, is the gigaspeech LR strategy appropriate or should we make any modifications?

@yaozengwei Could you have a look?

If possible could you do a global shuffle by following the example I just posted?

Shuffling helped with avoiding the spikes. The tensorboard after shuffling: https://github.com/k2-fsa/icefall/assets/173154737/d6ee582d-322c-4433-8089-9dae2e62463e

Does this imply that most of the bad data was concentrated together when unshuffled? Were those spikes caused due to mismatched transcripts or is it possible it's because of the difference in data quality due to some of the data being from different domains with different environments?

When the data is not shuffled, some mini-batches will consist primarily of sessions/speakers that were very difficult for any reason (out-of-domain/noisy/bad transcript/etc). It's like training the model with constantly shifting domains rather than randomly sampling them.

Even with 20k data, this problem still occurs. This is so strange.

Regular spikes in training metrics for zipformer training on custom data