Training code for our work VOCANO: A note transcription framework for singing voice in polyphonic music. For inference-only needs, please check the VOCANO repository or Omnizart which also includes other inference options including pitched instruments, vocal, chords, drum events, and beat.
Our training script is performed under Python3 and CUDA11.1, under PyTorch framework.
$ git clone https://github.com/B05901022/Note-Segmentation-SSL.git
$ cd Note-Segmentation-SSL
$ pip install -r requirements.txt
To run the full pipeline, datasets of certain category should include the files below:
- Training Datasets (TONAS, DALI)
wav
: includes raw waveform in.wav
formatsdt
: includes binary form ofsilent/duration/onset_not/onset/offset_not/offset
numpy arrays naming in<data_name>_sdt.npy
format
- Unlabeled Datasets (MIR-1K, Medley-DB, DALI, DALI_demucs)
wav
: includes raw waveform in.wav
format
- Testing Datasets (ISMIR2014, DALI, DALI_demucs, CMedia_demucs)
wav
: includes raw waveform in.wav
formatsdt
: includes binary form ofsilent/duration/onset_not/onset/offset_not/offset
numpy arrays naming in<data_name>_sdt.npy
formatpitch
,pitch_intervals
: pitch contour collected by Patch-CNN pipelineonoffset_intervals
: includes the onset time and offset time of every note.
The sdt
, pitch
, pitch_intervals
, onoffset_intervals
files can be downloaded from here. We also provided a simple script to download and construct the default folder hierarchy we used in our training script.
$ python file_prepare.py
Due to copyright issues, we cannot provide the audio files we used in our training procedure. However, all the datasets used publicly available, including TONAS, MIR-1K, Medley-DB, and DALI. Please follow the scripts provided by the original repositories to manually download the datasets and place the .wav files under ../data/<dataset_name>/wav/
folder or manually change the data folder in training/inference scripts (see next section). Note that for DALI dataset, we only pick the ground truth data that are selected by the DALI paper, so it is not necessary to download the whole dataset. All the data picked/downloaded in our work are listed in the ./meta/
folder. Some of the data that we used may already not valid on the internet (CMedia and DALI), so please ensure to delete the labels in the ./meta/
folder that are not downloadable to correctly run through the training/inference scripts.
To perform the vocal separation pipeline, please follow the instructions in DEMUCS repo and put the generated .wav
files in ./data/<dataset_name>_demucs/wav/
folder. Other vocal separation toolkits are also available, just to ensure that the files are named the same as original .wav
files and placed under the same ./data/<dataset_name>_demucs/wav/
folder.
To further speed up the unlabeled data loading speed for semi-supervised learning, it is also recommended to manually split the data into 8-second segments, and name the files into <data_name>_<segment_number>.wav
.
The dataset hierarchy should be like graph below. Training datasets should include wav
and sdt
folders; unlabeled datasets (for semi-supervised learning) should include wav
folder; testing datasets shuold include wav
, sdt
, pitch
, pitch_intervals
, onoffset_intervals
folders.
data/
└──TONAS/
└──wav/
└──sdt/
└──DALI_train/
└──wav/
└──sdt/
└──pitch/
└──pitch_intervals/
└──onoffset_intervals/
└──MIR_1K/
└──wav/
└──Medley_DB/
└──wav/
└──ISMIR2014/
└──wav/
└──sdt/
└──pitch/
└──pitch_intervals/
└──onoffset_intervals/
└──DALI_test/
└──wav/
└──sdt/
└──pitch/
└──pitch_intervals/
└──onoffset_intervals/
└──CMedia/
└──wav/
└──sdt/
└──pitch/
└──pitch_intervals/
└──onoffset_intervals/
└──DALI_demucs_test/ # Optional
└──wav/
└──sdt/
└──pitch/
└──pitch_intervals/
└──onoffset_intervals/
└──CMedia_demucs/ # Optional
└──wav/
└──sdt/
└──pitch/
└──pitch_intervals/
└──onoffset_intervals/
...
Note-Segmentation-SSL/
To train from scratch, please modify ./script/train.sh
to fit your need. Logging is also valid through WandB, which real-time tracking is valid from your WandB account.
- Parameters
--model_type: Which model to be used. Options: "PyramidNet_ShakeDrop"(default), "Resnet_18".
--loss_type: Pure supervised learning or semi-supervised learning with VAT. Options: "VAT"(default), "None"
--dataset1: Training dataset. Options: "TONAS"(default), "DALI_train", "DALI_orig_train", "DALI_demucs_train"
--dataset2: Unlabeled dataset for semi-supervised learning. Options: "MIR_1K"(default), "DALI_train", "DALI_demucs_train", "DALI_demucs_train_segment", "MedleyDB", "MedleyDB_segment", "None"
--dataset4: Validation dataset (only available for DALI). Options: "None"(default), "DALI_valid", "DALI_orig_valid", "DALI_demucs_valid"
--dataset5: Testing dataset. Options: "ISMIR2014"(default, "DALI_test", "DALI_orig_test", "DALI_demucs_test", "CMedia", "CMedia_demucs"
--data_path: The installation path of training dataset. Default: "../data/"
--exp_name: The experiment name that will be tracked on WandB.
--project_name: The experiment series name that will be tracked on WandB.
--entity: Your WandB account name.
--amp_level: Mixed-precision level. Default: "O1"
After modifying the training script, simply run the command to execute the training process.
$ bash script/train.sh
After training, it is also available to modify the --dataset5
and --checkpoint_name
arguments in ./script/test.sh
to further test on other datasets. Run the command to execute the testing process.
$ bash script/test.sh
If you find our work useful, please consider citing our paper.
- VOCANO
@inproceedings{vocano,
title={{VOCANO}: A Note Transcription Framework For Singing Voice In Polyphonic Music},
author={Hsu, Jui-Yang and Su, Li},
booktitle={Proc. International Society of Music Information Retrieval Conference (ISMIR)},
year={2021}
}
- Omnizart
@article{wu2021omnizart,
title={Omnizart: A General Toolbox for Automatic Music Transcription},
author={Wu, Yu-Te and Luo, Yin-Jyun and Chen, Tsung-Ping and Wei, I-Chieh and Hsu, Jui-Yang and Chuang, Yi-Chin and Su, Li},
journal={arXiv preprint arXiv:2106.00497},
year={2021}
}