BUTSpeechFIT / CALLHOME_sublists

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"Official" train/dev/test?

hbredin opened this issue · comments

I have never reported results on CALLHOME because of the (apparent) lack of an official train/validation/test split (or at least validation/test split).

What experimental protocol does BUT use for reporting results?
Validation on part1, test on part2?
Validation on part2, test on part1?
Both?

cc @fnlandini

Hi @hbredin
Thanks for bringing this up.
It is true that even our setup has evolved through time.
Following the setup that we inherited from JSALT 2016, in our original works with VBHMM clustering based methods (i.e. 1 and 2) we reported results on the whole set, excluding the file iaeu because it had labeling errors.
Later on, following the partition from Kaldi, we used part1 as validation and part2 as test and the other way around for cross validation and tuning VBx hyperparameters. Still, we reported results on the whole set and using oracle VAD.

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Thanks. That's very helpful.

So all papers by Hitachi use part1 for fine-tuning and part2 for testing?

What about updating the README with your answer? This would definitely help the community (in the same way AMI-diarization-setup does for AMI).

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models

Yes, we used this setup.

Thanks for sharing. FYI, in our previous work we did 5-fold evaluation.

We randomly partition the dataset into five subsets, and each time leave one subset for evaluation, and train UIS-RNN on the other four subsets. Then we combine the evaluation on five subsets and report the averaged DER.

image

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Yes, we used the same setup recently (cc @popcornell) where part1 was used for adaptation.

Thanks everyone for the comments.
@hbredin I've added a pointer to this issue in the README and we can keep it open for future reference

Thanks everyone for your feedback!
Let's make our (future) results comparable :)

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

The ones I used are shared here: https://github.com/google/speaker-id/tree/master/publications/LstmDiarization/evaluation/NIST_SRE2000

Disk 8 is CALLHOME, and Disk 6 is SwitchBoard.

Thanks @wq2012. That is what I started using as well.
Can anyone else confirm that those are the only version circulating in our community?

Hi Herve,

Callhome is LDC propietary data that can only be obtained after purchase and we believe that we might violate some copyright issues if we publish the reference files from it.
But given that @wq2012 publicly shared his, yes, they are the same we use. With the exception that, as mentioned above, we do not use the file iaeu.

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

Hmm, are you sure?

Is that the same version as the LDC callhome?

IIRC we simply searched Google and downloaded them from other publicly available domains and thought these had already been publicly circulated.

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

Totally makes sense. Thanks!

@wq2012, there are several CALLHOME LDC datasets. That is why CALLHOME can refer so many sets in publications.
This specific CALLHOME data is not that easy to find, unless you know the origin. It is part of the 2000 NIST Speaker Recognition Evaluation, which can be found under LDC Catalog No. LDC2001S97.
The references were released as part of the NIST keys after the evaluation.

We are waiting for a response from LDC, we will write an update after we hear from them.

Thanks! But I don't think the references are included in any of the LDC Catalogs.

For future reference, the RTTMs are also here: http://www.openslr.org/resources/10/sre2000-key.tar.gz