iariav / End-to-End-VAD

an Audio-Visual Voice Activity Detection using Deep Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Difference in published and generated results

KumudTripathi opened this issue · comments

Hello Team,

Thanks for providing the repo.

I have replicated this repo step by step as per the details mentioned in the paper and in this repo.
First I have trained both streams separately and then used their pretrained weights to train multimodal architecture.

From the experiments, I can see that there is mismatch in the generated result (Accuracy ~82%) and the published result (Accuracy ~91%).
Can I get the guidance from the team to achieve the same results?

Thanks in advance.