transcription

Question

transcription

dutchsing009 opened this issue 6 months ago · comments

Can you please give an example of a transcriptions.csv file with name ph_seq, ph_dur and ph_num in it.

I want to see a reference file.

Sean Wood · Answer 1 · Mon Nov 20 2023 12:55:18 GMT+0800 (China Standard Time)

If you have ever made DiffSinger datasets you should be familiar with transcriptions.csv. See https://github.com/openvpi/MakeDiffSinger, if you haven't done that before and want to learn more details. There is also a link to this SOME repository in https://github.com/openvpi/MakeDiffSinger/tree/main/variance-temp-solution, and you can understand everything once you reach that step.

dutchsing009 · Answer 2 · Mon Nov 20 2023 17:32:11 GMT+0800 (China Standard Time)

1- Does this variance temp solution link work for English or French datasets ?

Ok Thanks ,
So if I understand this correctly , if I have ph_seq ph_dur ph_num I can Use SOME to get the midi sequence and midi duration sequence ? if yes I have 2 Questions

1- How can I obtain those 3 ph_seq , _dur, _num.? I saw 2 tools but I'm not sure if they will obtain those 3!
https://github.com/wolfgitpr/LyricFA
https://github.com/Anjiurine/fast-phasr-next
Is there any other tool that will automatically generate me the Phoneme Sequence| Phoneme duration Sequence|Phoneme num?

2- How accurate are the generated midi sequence and midi duration sequence going to be ? like 100% ? ( I'm asking as if it isn't 100%, I think it will make the model hallucinate during SVS inference )

Sean Wood · Answer 3 · Mon Nov 20 2023 21:03:39 GMT+0800 (China Standard Time)

ph_seq and ph_dur should be obtained when you finished making your DiffSinger acoustic dataset. Many tools and pipelines can do this. But as far as I know, ph_num can only be obtained by the method described in MakeDiffSinger repository, and unfortunately, there are no proper method of automatic ph_num inference for polysyllabic languages like English and French yet. However, I already have an idea to do this as described in openvpi/MakeDiffSinger#11. If you have some suggestions you can comment on that issue.
The pretrained model of SOME is trained on pure Chinese datasets. Though SOME is language-irrelevant, it may not produce as good results as on its "native" language. But we do benefit from it for reducing the time cost of manual MIDI labeling, because of its ability to recognize slur notes and generate cent-level MIDI values.

dutchsing009 · Answer 4 · Mon Nov 27 2023 23:13:20 GMT+0800 (China Standard Time)

does this help ?
https://github.com/colstone/ENG_dur_num

Sean Wood · Answer 5 · Wed Nov 29 2023 02:16:14 GMT+0800 (China Standard Time)

Yes, this can help, in some degree. But I doubt if simply specifying all vowels is enough and proper for polysyllabic languages. A more detailed discussion was raised here: openvpi/MakeDiffSinger#12