k2-fsa / icefall

Home Page:https://k2-fsa.github.io/icefall/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use CutSet.mux to effect?

johnchienbronci opened this issue · comments

@pzelasko
Hi, I has a question
According to the weights description,

 Since the iterables might be of different length, we provide a ``weights`` parameter to let the user decide which iterables should be sampled more frequently than others. 
When an iterable is exhausted, we will keep sampling from the other iterables, until we exhaust them all

If there are 3 different language of datasets and these length number vary greatly
ex.
a = lhotse.load_manifest_lazy(...) ==> len(a) = 300000
b = lhotse.load_manifest_lazy(...) ==> len(b)= 1000000
c = lhotse.load_manifest_lazy(...) ==> len(c)= 50000

CutSet.mux(a, b, c, weights=[len(a), len(b), len(c)])

According to this situation

  1. Will it sample them a bit more uniformly throughout the whole output sequence?

  2. Does the model adapt poorly to C data(c data len is very small, so it's faster exhausted)

  3. Is it possible to first expand the data of the three different languages ​​to the same amount?,
    Similar to this way: interleave_datasets(..., stopping_strategy = "all_exhausted")

    Is there any other way?

  1. Yes the distribution of examples in mini batches will be stationary (the same as the weights in mux), until some iterator ends (the tail of the iteration doesn’t preserve that anymore)
  2. It’s hard to say, but typically over sampling smaller datasets helps to get better results on them
  3. I suggest calling .repeat().shuffle() on every input cutset to make them infinite and tweaking the weights. The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

Thanks for your replay.

The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

How can I modify the code (zipformer/train.py) to set the maximum number of steps instead of epochs?
Currently, the parameter can only specify the number of epochs.

You might need to modify the code a bit in that case. Since the dataloader will never finish iteration you’d need to move validation and checkpoint saving into the training loop to be executed every N steps. Also make sure to set a different random seed if you continue the training otherwise you’ll iterate over the same data as before (most Lhotse classes accept a “trng” seed value to automatically randomize the seed at the cost of non-reproducibility)

ok, thanks

@pzelasko
Hi,
I have a question about .repeat()

zipformer_map

When I use .repeat() on train_cuts, it seems to call the map function in every epoch

epoch 1:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PÈRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PũĩRE LACHAISE CEMETERY IN PARIS
epoch 2:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PũĩRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŎŪŎRE LACHAISE CEMETERY IN PARIS
epoch 3:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PūŎŪŎRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŐūįūŐūįRE LACHAISE CEMETERY IN PARIS
epoch 4:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PūŐūįūŐūįRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŐūıūŐŪŔūŐūıūŐŪŔRE LACHAISE CEMETERY IN PARIS

It looks very weird, the origin character(i.e. È) is converted into another character, and then it gets longer and longer in every epoch

If not use .repeat() it won't happen

Is this a bug with .repeat()?

You may want to perform a copy of both cut and supervison inside of map function to avoid repeated application of this function. E.g.

from lhotse.utils import fastcopy
return fastcopy(c, supervisions=[fastcopy(cut.supervisions[0], text=new_text)])

Thanks for your reply
It's working for me

But why does it execute the map function and it modify original data in every epoch when using .repeat(), at the same time, valid_cuts does not modify the original data
In my test, I use .repeat(1) in train_cuts, and valid_cuts didn't

Thanks for your replay.

The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

How can I modify the code (zipformer/train.py) to set the maximum number of steps instead of epochs? Currently, the parameter can only specify the number of epochs.

Something else to watch out for, in case you do this you should use the Eden2 scheduler, not the Eden scheduler.

Thanks for your reply It's working for me

But why does it execute the map function and it modify original data in every epoch when using .repeat(), at the same time, valid_cuts does not modify the original data In my test, I use .repeat(1) in train_cuts, and valid_cuts didn't

It might be related to eager vs lazy cut set. Lazy cut set is read from the file on each iteration so the mutating changes are not persistent. With eager cut sets they are persistent and stacked on each other upon each iteration.

Thanks for your help