modelscope / 3D-Speaker

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to compute the ERes2Net model param?

JiJiJiang opened this issue · comments

Hello, I use the same model params as your configs in https://github.com/alibaba-damo-academy/3D-Speaker/blob/6f6ed3189a4d1db040586a518c8e5d80f4fc0665/egs/3dspeaker/sv-eres2net/conf/eres2net.yaml, but I get 9.88M. (Yours is 4.6M)

Here is the way I compute the model params:
image
I'm wondering where the difference is ?

Even I set embedding_size=192, I still got 6.61 M.

The key difference arises from how we compute the model parameters. Since the classifier isn't used during inference, it's not factored into the statistical calculation of the model parameters.

Thank you for your answer!
But what part of the ERes2Net model is the classifier you mean? Is it the output linear layer mapping the embedding into the speaker label? However, it is not defined in the model.

Yes, the classifier refers to the output linear layer mapping the embedding into the speaker label. Therefore, since these parameters will be discarded during inference, they are not factored into the model parameter calculations.

Thank you for your answer. I directly initialize the ERes2Net model as defined in ResNet.py, which does not contain the classifier as you mention above. The code lines in the screenshot are directly appended in the end of your ResNet.py and run python ResNet.py. So I think my calculation result should be consistent with yours. What is wrong with my codes?

It would be nice if you can share the codes you calculate the model parameters. Thanks so much!

Apologies for my oversight, I overlooked the parameters following the statistical pooling layer. With an embedding size of 192, the model parameters total 6.61M. When the embedding size is 512, the model parameters amount to 9.88M. I'll update this on arXiv paper and GitHub soon. Thank you very much for reminding.