Use CutSet.mux to effect?

Question

Use CutSet.mux to effect?

johnchienbronci opened this issue 3 months ago · comments

@pzelasko
Hi, I has a question
According to the weights description,

 Since the iterables might be of different length, we provide a ``weights`` parameter to let the user decide which iterables should be sampled more frequently than others. 
When an iterable is exhausted, we will keep sampling from the other iterables, until we exhaust them all

If there are 3 different language of datasets and these length number vary greatly
ex.
a = lhotse.load_manifest_lazy(...) ==> len(a) = 300000
b = lhotse.load_manifest_lazy(...) ==> len(b)= 1000000
c = lhotse.load_manifest_lazy(...) ==> len(c)= 50000

CutSet.mux(a, b, c, weights=[len(a), len(b), len(c)])

According to this situation

Will it sample them a bit more uniformly throughout the whole output sequence?
Does the model adapt poorly to C data(c data len is very small, so it's faster exhausted)
Is it possible to first expand the data of the three different languages to the same amount?,
Similar to this way： interleave_datasets(..., stopping_strategy = "all_exhausted")

Is there any other way？

Piotr Żelasko · Answer 1 · Fri May 17 2024 22:32:48 GMT+0800 (China Standard Time)

Yes the distribution of examples in mini batches will be stationary (the same as the weights in mux), until some iterator ends (the tail of the iteration doesn’t preserve that anymore)
It’s hard to say, but typically over sampling smaller datasets helps to get better results on them
I suggest calling .repeat().shuffle() on every input cutset to make them infinite and tweaking the weights. The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

johnchienbronci · Answer 2 · Sat May 18 2024 11:30:40 GMT+0800 (China Standard Time)

Thanks for your replay.

The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

How can I modify the code (zipformer/train.py) to set the maximum number of steps instead of epochs?
Currently, the parameter can only specify the number of epochs.

Piotr Żelasko · Answer 3 · Sat May 18 2024 20:27:44 GMT+0800 (China Standard Time)

You might need to modify the code a bit in that case. Since the dataloader will never finish iteration you’d need to move validation and checkpoint saving into the training loop to be executed every N steps. Also make sure to set a different random seed if you continue the training otherwise you’ll iterate over the same data as before (most Lhotse classes accept a “trng” seed value to automatically randomize the seed at the cost of non-reproducibility)

johnchienbronci · Answer 4 · Mon May 20 2024 17:35:15 GMT+0800 (China Standard Time)

ok, thanks

novahsubronci · Answer 5 · Wed Jun 12 2024 18:44:10 GMT+0800 (China Standard Time)

@pzelasko
Hi,
I have a question about .repeat()

When I use .repeat() on train_cuts, it seems to call the map function in every epoch

epoch 1:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PÈRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PũĩRE LACHAISE CEMETERY IN PARIS
epoch 2:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PũĩRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŎŪŎRE LACHAISE CEMETERY IN PARIS
epoch 3:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PūŎŪŎRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŐūįūŐūįRE LACHAISE CEMETERY IN PARIS
epoch 4:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PūŐūįūŐūįRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŐūıūŐŪŔūŐūıūŐŪŔRE LACHAISE CEMETERY IN PARIS

It looks very weird, the origin character(i.e. È) is converted into another character, and then it gets longer and longer in every epoch

If not use .repeat() it won't happen

Is this a bug with .repeat()?

Piotr Żelasko · Answer 6 · Thu Jun 13 2024 01:49:43 GMT+0800 (China Standard Time)

You may want to perform a copy of both cut and supervison inside of map function to avoid repeated application of this function. E.g.

from lhotse.utils import fastcopy
return fastcopy(c, supervisions=[fastcopy(cut.supervisions[0], text=new_text)])

novahsubronci · Answer 7 · Thu Jun 13 2024 10:37:33 GMT+0800 (China Standard Time)

Thanks for your reply
It's working for me

But why does it execute the map function and it modify original data in every epoch when using .repeat(), at the same time, valid_cuts does not modify the original data
In my test, I use .repeat(1) in train_cuts, and valid_cuts didn't

Daniel Povey · Answer 8 · Fri Jun 14 2024 16:14:45 GMT+0800 (China Standard Time)

Thanks for your replay.

The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

How can I modify the code (zipformer/train.py) to set the maximum number of steps instead of epochs? Currently, the parameter can only specify the number of epochs.

Something else to watch out for, in case you do this you should use the Eden2 scheduler, not the Eden scheduler.

Piotr Żelasko · Answer 9 · Thu Jun 20 2024 07:50:38 GMT+0800 (China Standard Time)

Thanks for your reply It's working for me

But why does it execute the map function and it modify original data in every epoch when using .repeat(), at the same time, valid_cuts does not modify the original data In my test, I use .repeat(1) in train_cuts, and valid_cuts didn't

It might be related to eager vs lazy cut set. Lazy cut set is read from the file on each iteration so the mutating changes are not persistent. With eager cut sets they are persistent and stacked on each other upon each iteration.

novahsubronci · Answer 10 · Thu Jun 20 2024 09:51:08 GMT+0800 (China Standard Time)

Thanks for your help