How combine MFCC and CPCfeatures

Question

How combine MFCC and CPCfeatures

cyl250 opened this issue 5 years ago · comments

Thank you for sharing your code, I have meet some problem.
When we use CPC, it is [128,256] but mfcc is [frame,39]，
as you result， I wonder how to combine it in [frame, 39 + 256] dims.
Thanks again

Cheng-I Jeff Lai · Answer 1 · Mon Jul 22 2019 18:01:42 GMT+0800 (China Standard Time)

hi @cyl250
It is common to combine features by simply concatenate them (along the feature dimension).

The CPC feature is [num_framess, 256] and MFCC is [num_frames, 39]. Concatenating them would give [num_frames, 256+39].

cyl250 · Answer 2 · Mon Jul 22 2019 18:17:55 GMT+0800 (China Standard Time)

sorry, I have some trouble to understant
After model.predict() the CPC features is in [128,256] dims.
Do I need change the numbers of node of network to fix mode.prdict() return a [num_frames,256] vectors?

Cheng-I Jeff Lai · Answer 3 · Mon Jul 22 2019 20:30:25 GMT+0800 (China Standard Time)

128 is the number of frames during TRAINING. In the CPC training, random chunks from the raw waveform are selected and input to the encoder. For example, a random chunk of 20480 data points corresponds to 1.28 seconds, or 128 frames (16k Hz audio).

During inference, you should input the entire utterance instead of the chunks. This will give you the correct number of frames instead of 128.

cyl250 · Answer 4 · Mon Jul 22 2019 23:54:40 GMT+0800 (China Standard Time)

Thank you very much, I got it. It helps a lot. Thank you again.

…

------------------ 原始邮件 ------------------ 发件人: "Cheng-I Jeff Lai"<notifications@github.com>; 发送时间: 2019年7月22日(星期一) 晚上8:30 收件人: "jefflai108/Contrastive-Predictive-Coding-PyTorch"<Contrastive-Predictive-Coding-PyTorch@noreply.github.com>; 抄送: "陈雨龙"<936229102@qq.com>;"Mention"<mention@noreply.github.com>; 主题: Re: [jefflai108/Contrastive-Predictive-Coding-PyTorch] How combineMFCC and CPCfeatures (#3) 128 is the number of frames during TRAINING. In the CPC training, random chunks from the raw waveform are selected and input to the encoder. For example, a random chunk of 20480 data points corresponds to 1.28 seconds, or 128 frames (16k Hz audio). During inference, you should input the entire utterance instead of the chunks. This will give you the correct number of frames instead of 128. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.