facebookresearch / encodec

State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Preparing Train Dataset (mixing strategy)

kthworks opened this issue · comments

Thank you for your excellent work.

I am in the process of training the EnCodec model and have some questions regarding the mixing strategy.

I am interested in learning more about the entire training dataset. The paper outlines the training/validation set into four parts as follows:
(s1) Sampling a single source from Jamendo with a probability of 0.32;
(s2) Sampling a single source from other datasets with the same probability;
(s3) Mixing two sources from all datasets with a probability of 0.24;
(s4) Mixing three sources from all datasets except music with a probability of 0.12.

Does this mean that the training/validation dataset is composed of segments in the ratio of s1/s2/s3/s4 = 32%/32%/24%/12%? In the appendix, Table 1 indicates that the duration of the Jamendo dataset is 919 hours, but the duration of Common Voice is 9,096 hours. Did you not use all the samples from Common Voices?

I would also like to know more about the process of applying reverberation. Apart from the samples available in DNS, how do you apply reverberation to samples from other datasets? Is there a way to calculate the room impulse response? I would appreciate it if you could let me know where I can refer to any related implementations.

If anyone can provide assistance regarding this matter, please leave a comment, Thank you :)