minzwon / sota-music-tagging-models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data splitting of MTAT dataset

suncerock opened this issue · comments

Hi, thank you for your great work and for building benchmark results for all the representative auto-tagging models. I have a further question on the data splitting of the MTAT dataset.

In the SMC paper, you mentioned that you did not discard the tracks with no associate labels (which might lead to performance decay). However, both the split npy files in this repo and also the split files in this repo you referred to in the SMC paper discard those tracks away. Could I kindly ask whether the results are based on the cleaned version of the dataset which discard those tracks?

For your reference, the original version should have 18706 tracks for training, 1825 for validation, and 5329 for testing (25860 in total). The clean version should have 15247 for training, 1529 for validation, and 4332 for testing (21108 in total).

Hi,

If I remember correctly, the results on the table are using the cleaned version (21k clips in total). Due to the data spli discrepancy described in the paper, we reproduced all the results using this new split.