"Official" train/dev/test?

Question

"Official" train/dev/test?

hbredin opened this issue 2 years ago · comments

I have never reported results on CALLHOME because of the (apparent) lack of an official train/validation/test split (or at least validation/test split).

What experimental protocol does BUT use for reporting results?
Validation on part1, test on part2?
Validation on part2, test on part1?
Both?

cc @fnlandini

fnlandini · Answer 1 · Tue Jun 14 2022 21:30:40 GMT+0800 (China Standard Time)

Hi @hbredin
Thanks for bringing this up.
It is true that even our setup has evolved through time.
Following the setup that we inherited from JSALT 2016, in our original works with VBHMM clustering based methods (i.e. 1 and 2) we reported results on the whole set, excluding the file iaeu because it had labeling errors.
Later on, following the partition from Kaldi, we used part1 as validation and part2 as test and the other way around for cross validation and tuning VBx hyperparameters. Still, we reported results on the whole set and using oracle VAD.

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Hervé BREDIN · Answer 2 · Tue Jun 14 2022 21:36:24 GMT+0800 (China Standard Time)

Thanks. That's very helpful.

So all papers by Hitachi use part1 for fine-tuning and part2 for testing?

What about updating the README with your answer? This would definitely help the community (in the same way AMI-diarization-setup does for AMI).

Hervé BREDIN · Answer 3 · Tue Jun 14 2022 21:42:11 GMT+0800 (China Standard Time)

cc @desh2608 @sw005320 @wq2012

Shinji Watanabe · Answer 4 · Tue Jun 14 2022 21:45:23 GMT+0800 (China Standard Time)

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models

Yes, we used this setup.

Quan Wang · Answer 5 · Tue Jun 14 2022 22:12:50 GMT+0800 (China Standard Time)

Thanks for sharing. FYI, in our previous work we did 5-fold evaluation.

We randomly partition the dataset into five subsets, and each time leave one subset for evaluation, and train UIS-RNN on the other four subsets. Then we combine the evaluation on five subsets and report the averaged DER.

Desh Raj · Answer 6 · Tue Jun 14 2022 22:41:35 GMT+0800 (China Standard Time)

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Yes, we used the same setup recently (cc @popcornell) where part1 was used for adaptation.

fnlandini · Answer 7 · Tue Jun 14 2022 22:48:53 GMT+0800 (China Standard Time)

Thanks everyone for the comments.
@hbredin I've added a pointer to this issue in the README and we can keep it open for future reference

Hervé BREDIN · Answer 8 · Tue Jun 14 2022 22:53:07 GMT+0800 (China Standard Time)

Thanks everyone for your feedback!
Let's make our (future) results comparable :)

Hervé BREDIN · Answer 9 · Tue Jul 19 2022 20:30:21 GMT+0800 (China Standard Time)

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

Quan Wang · Answer 10 · Tue Jul 19 2022 21:50:27 GMT+0800 (China Standard Time)

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

The ones I used are shared here: https://github.com/google/speaker-id/tree/master/publications/LstmDiarization/evaluation/NIST_SRE2000

Disk 8 is CALLHOME, and Disk 6 is SwitchBoard.

Hervé BREDIN · Answer 11 · Thu Jul 21 2022 15:03:31 GMT+0800 (China Standard Time)

Thanks @wq2012. That is what I started using as well.
Can anyone else confirm that those are the only version circulating in our community?

MireiaDS · Answer 12 · Fri Jul 22 2022 15:15:13 GMT+0800 (China Standard Time)

Hi Herve,

Callhome is LDC propietary data that can only be obtained after purchase and we believe that we might violate some copyright issues if we publish the reference files from it.
But given that @wq2012 publicly shared his, yes, they are the same we use. With the exception that, as mentioned above, we do not use the file iaeu.

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

Quan Wang · Answer 13 · Fri Jul 22 2022 20:26:17 GMT+0800 (China Standard Time)

Hmm, are you sure?

Is that the same version as the LDC callhome?

IIRC we simply searched Google and downloaded them from other publicly available domains and thought these had already been publicly circulated.

Hervé BREDIN · Answer 14 · Fri Jul 22 2022 22:48:59 GMT+0800 (China Standard Time)

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

Totally makes sense. Thanks!

MireiaDS · Answer 15 · Tue Jul 26 2022 20:05:18 GMT+0800 (China Standard Time)

@wq2012, there are several CALLHOME LDC datasets. That is why CALLHOME can refer so many sets in publications.
This specific CALLHOME data is not that easy to find, unless you know the origin. It is part of the 2000 NIST Speaker Recognition Evaluation, which can be found under LDC Catalog No. LDC2001S97.
The references were released as part of the NIST keys after the evaluation.

We are waiting for a response from LDC, we will write an update after we hear from them.

Quan Wang · Answer 16 · Tue Jul 26 2022 22:24:33 GMT+0800 (China Standard Time)

Thanks! But I don't think the references are included in any of the LDC Catalogs.

fnlandini · Answer 17 · Wed Mar 15 2023 23:02:24 GMT+0800 (China Standard Time)

For future reference, the RTTMs are also here: http://www.openslr.org/resources/10/sre2000-key.tar.gz