transcription
dutchsing009 opened this issue · comments
Can you please give an example of a transcriptions.csv file with name
ph_seq
, ph_dur
and ph_num
in it.
I want to see a reference file.
If you have ever made DiffSinger datasets you should be familiar with transcriptions.csv. See https://github.com/openvpi/MakeDiffSinger, if you haven't done that before and want to learn more details. There is also a link to this SOME repository in https://github.com/openvpi/MakeDiffSinger/tree/main/variance-temp-solution, and you can understand everything once you reach that step.
1- Does this variance temp solution link work for English or French datasets ?
Ok Thanks ,
So if I understand this correctly , if I have ph_seq
ph_dur
ph_num
I can Use SOME to get the midi sequence
and midi duration sequence
? if yes I have 2 Questions
1- How can I obtain those 3 ph_seq , _dur, _num.? I saw 2 tools but I'm not sure if they will obtain those 3!
https://github.com/wolfgitpr/LyricFA
https://github.com/Anjiurine/fast-phasr-next
Is there any other tool that will automatically generate me the Phoneme Sequence| Phoneme duration Sequence|Phoneme num
?
2- How accurate are the generated midi sequence and midi duration sequence going to be ? like 100% ? ( I'm asking as if it isn't 100%, I think it will make the model hallucinate during SVS inference )
ph_seq
andph_dur
should be obtained when you finished making your DiffSinger acoustic dataset. Many tools and pipelines can do this. But as far as I know,ph_num
can only be obtained by the method described in MakeDiffSinger repository, and unfortunately, there are no proper method of automaticph_num
inference for polysyllabic languages like English and French yet. However, I already have an idea to do this as described in openvpi/MakeDiffSinger#11. If you have some suggestions you can comment on that issue.- The pretrained model of SOME is trained on pure Chinese datasets. Though SOME is language-irrelevant, it may not produce as good results as on its "native" language. But we do benefit from it for reducing the time cost of manual MIDI labeling, because of its ability to recognize slur notes and generate cent-level MIDI values.
does this help ?
https://github.com/colstone/ENG_dur_num
Yes, this can help, in some degree. But I doubt if simply specifying all vowels is enough and proper for polysyllabic languages. A more detailed discussion was raised here: openvpi/MakeDiffSinger#12