Can this model be applied to other data modalities?
ZDstandup opened this issue · comments
freeloop_zhang commented
Hi, authors. Can this model be applied to other data modalities? such as audio, text,...
Have you tried it?
Hope that you can give me some suggestions. Thanks in advance!
Jooyoung Choi commented
Directly using our method to those modalities might be difficult. Low frequency audio or masked texts may not contain semantics like low-resolution images. Meanwhile, guided-tts uses additional classifier for guiding text-to-speech models.