yl4579 / StyleTTS

Official Implementation of StyleTTS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mandrain support?

lucasjinreal opened this issue · comments

mandrain support?

I did try training for other languages including Mandarin, Japanese, Hindi etc., though it requires a few changes:

  1. You need to phonemize Chinese into IPAs. You can use either https://github.com/bootphon/phonemizer or a look-up table to replace Chinese characters into IPAs. The pre-trained text aligner already includes AiShell (Mandarin dataset), with the following IPA conversion table from Pinyin. It may be slightly different from phonemizer, as it didn't work for me for Chinese.
ba pˈa
bo pˈwɔ
bai pˈaɪ
bei pˈeɪ
bao pˈaʊ
ban pˈan
ben pˈən
bang pˈɑŋ
beng pˈəŋ
bi pˈi
biao pˈjaʊ
bie pˈjɛ
bian pˈjɛn
bin pˈin
bing pˈiŋ
bu pˈu
pa pʰˈa
po pʰˈwɔ
pai pʰˈaɪ
pei pʰˈeɪ
pao pʰˈaʊ
pou pʰˈoʊ
pan pʰˈan
pen pʰˈən
pang pʰˈɑŋ
peng pʰˈəŋ
pi pʰˈi
piao pʰˈjaʊ
pie pʰˈjɛ
pian pʰˈjɛn
pin pʰˈin
ping pʰˈiŋ
pu pʰˈu
ma mˈa
me mˈɤ
mo mˈwɔ
mai mˈaɪ
mei mˈeɪ
mao mˈaʊ
mou mˈoʊ
man mˈan
men mˈən
mang mˈɑŋ
meng mˈəŋ
mi mˈi
miao mˈjaʊ
mie mˈjɛ
miu mˈju
mian mˈjɛn
min mˈin
ming mˈiŋ
mu mˈu
fa fˈa
fo fˈwɔ
fei fˈeɪ
fou fˈoʊ
fan fˈan
fen fˈən
fang fˈɑŋ
feng fˈəŋ
fu fˈu
da tˈa
de tˈɤ
dai tˈaɪ
dei tˈeɪ
dao tˈaʊ
dou tˈoʊ
dan tˈan
dang tˈɑŋ
deng tˈəŋ
dong tˈʊŋ
di tˈi
diao tˈjaʊ
die tˈjɛ
diu tˈjoʊ
dian tˈjɛn
ding tˈiŋ
du tˈu
duo tˈwɔ
dui tˈweɪ
duan tˈwan
dun tˈwən
ta tʰˈa
te tʰˈɤ
tai tʰˈaɪ
tao tʰˈaʊ
tou tʰˈoʊ
tan tʰˈan
tang tʰˈɑŋ
teng tʰˈəŋ
tong tʰˈʊŋ
ti tʰˈi
tiao tʰˈjaʊ
tie tʰˈjɛ
tian tʰˈjɛn
ting tʰˈiŋ
tu tʰˈu
tuo tʰˈwɔ
tui tʰˈweɪ
tuan tʰˈwan
tun tʰˈwən
na nˈa
ne nˈɤ
nai nˈaɪ
nei nˈeɪ
nao nˈaʊ
nou nˈoʊ
nan nˈan
nen nˈən
nang nˈɑŋ
neng nˈəŋ
nong nˈʊŋ
ni nˈi
niao nˈjaʊ
nie nˈjɛ
niu nˈjoʊ
nian nˈjɛn
nin nˈin
niang nˈiɑŋ
ning nˈiŋ
nu nˈu
nuo nˈwɔ
nuan nˈwan
nü nˈy
nüe nˈyɛ
la lˈa
le lˈɤ
lai lˈaɪ
lei lˈeɪ
lao lˈaʊ
lou lˈoʊ
lan lˈan
lang lˈɑŋ
leng lˈəŋ
long lˈʊŋ
li lˈi
lia lˈja
liao lˈjaʊ
lie lˈjɛ
liu lˈjoʊ
lian lˈjɛn
lin lˈin
liang lˈiɑŋ
ling lˈiŋ
lu lˈu
luo lˈwɔ
luan lˈwan
lun lˈwən
lü lˈy
lüe lˈyɛ
za tsˈa
ze tsˈɤ
zi tsˈɹ
zai tsˈaɪ
zei tsˈeɪ
zao tsˈaʊ
zou tsˈoʊ
zan tsˈan
zen tsˈən
zang tsˈɑŋ
zeng tsˈəŋ
zong tsˈʊŋ
zu tsˈu
zuo tsˈwɔ
zui tsˈweɪ
zuan tsˈwan
zun tsˈwən
ca tsʰˈa
ce tsʰˈɤ
ci tsʰˈɹ
cai tsʰˈaɪ
cao tsʰˈaʊ
cou tsʰˈoʊ
can tsʰˈan
cen tsʰˈən
cang tsʰˈɑŋ
ceng tsʰˈəŋ
cong tsʰˈʊŋ
cu tsʰˈu
cuo tsʰˈwɔ
cui tsʰˈweɪ
cuan tsʰˈwan
cun tsʰˈwən
sa sˈa
se sˈɤ
si sˈɹ
sai sˈaɪ
sao sˈaʊ
sou sˈoʊ
san sˈan
sen sˈən
sang sˈɑŋ
seng sˈeŋ
song sˈʊŋ
su sˈu
suo sˈwɔ
sui sˈweɪ
suan sˈwan
sun sˈwən
zha ʈʂˈa
zhe ʈʂˈɤ
zhi ʈʂˈʐ
zhai ʈʂˈaɪ
zhei ʈʂˈeɪ
zhao ʈʂˈaʊ
zhou ʈʂˈoʊ
zhan ʈʂˈan
zhen ʈʂˈən
zhang ʈʂˈɑŋ
zheng ʈʂˈəŋ
zhong ʈʂˈʊŋ
zhu ʈʂˈu
zhua ʈʂˈwa
zhuo ʈʂˈwɔ
zhuai ʈʂˈwaɪ
zhui ʈʂˈweɪ
zhuan ʈʂˈwan
zhun ʈʂˈwən
zhuang ʈʂˈwɑŋ
cha ʈʂʰˈa
che ʈʂʰˈɤ
chi ʈʂʰˈʐ
chai ʈʂʰˈaɪ
chao ʈʂʰˈaʊ
chou ʈʂʰˈoʊ
chan ʈʂʰˈan
chen ʈʂʰˈən
chang ʈʂʰˈɑŋ
cheng ʈʂʰˈəŋ
chong ʈʂʰˈʊŋ
chu ʈʂʰˈu
chua ʈʂʰˈwa
chuo ʈʂʰˈwɔ
chuai ʈʂʰˈwaɪ
chui ʈʂʰˈweɪ
chuan ʈʂʰˈwan
chun ʈʂʰˈwən
chuang ʈʂʰˈwɑŋ
sha ʂˈa
she ʂˈɤ
shi ʂˈʐ
shai ʂˈaɪ
shei ʂˈeɪ
shao ʂˈaʊ
shou ʂˈoʊ
shan ʂˈan
shen ʂˈən
shang ʂˈɑŋ
sheng ʂˈəŋ
shu ʂˈu
shua ʂˈwa
shuo ʂˈwɔ
shuai ʂˈwaɪ
shui ʂˈweɪ
shuan ʂˈwan
shun ʂˈwən
shuang ʂˈwɑŋ
re ɹˈɤ
ri ɹˈʐ
rao ɹˈaʊ
rou ɹˈoʊ
ran ɹˈan
ren ɹˈən
rang ɹˈɑŋ
reng ɹˈəŋ
rong ɹˈʊŋ
ru ɹˈu
ruo ɹˈwɔ
rui ɹˈweɪ
ruan ɹˈwan
run ɹˈwən
ji tɕˈi
jia tɕˈja
jiao tɕˈjaʊ
jie tɕˈjɛ
jiu tɕˈjoʊ
jian tɕˈjɛn
jin tɕˈin
jiang tɕˈiɑŋ
jing tɕˈiŋ
jiong tɕˈjʊŋ
ju tɕˈy
jue tɕˈyɛ
juan tɕˈyɛn
jun tɕˈyn
qi tɕʰˈi
qia tɕʰˈja
qiao tɕʰˈjaʊ
qie tɕʰˈjɛ
qiu tɕʰˈjoʊ
qian tɕʰˈjɛn
qin tɕʰˈin
qiang tɕʰˈjɑŋ
qing tɕʰˈiŋ
qiong tɕʰˈjʊŋ
qu tɕʰˈy
que tɕʰˈyɛ
quan tɕʰˈyɛn
qun tɕʰˈyn
xi ɕˈi
xia ɕˈja
xiao ɕˈjaʊ
xie ɕˈjɛ
xiu ɕˈjoʊ
xian ɕˈjɛn
xin ɕˈin
xiang ɕˈiɑŋ
xing ɕˈiŋ
xiong ɕˈjʊŋ
xu ɕˈy
xue ɕˈyɛ
xuan ɕˈyɛn
xun ɕˈyn
ga kˈa
ge kˈɤ
gai kˈaɪ
gei kˈeɪ
gao kˈaʊ
gou kˈoʊ
gan kˈan
gen kˈən
gang kˈɑŋ
geng kˈəŋ
gong kˈʊŋ
gu kˈu
gua kˈwa
guo kˈwɔ
guai kˈwaɪ
gui kˈweɪ
guan kˈwan
gun kˈwən
guang kˈwɑŋ
ka kʰˈa
ke kʰˈɤ
kai kʰˈaɪ
kei kʰˈeɪ
kao kʰˈaʊ
kou kʰˈoʊ
kan kʰˈan
ken kʰˈən
kang kʰˈɑŋ
keng kʰˈəŋ
kong kʰˈʊŋ
ku kʰˈu
kua kʰˈwa
kuo kʰˈwɔ
kuai kʰˈwaɪ
kui kʰˈweɪ
kuan kʰˈwan
kun kʰˈwən
kuang kʰˈwɑŋ
ha xˈa
he xˈɤ
hai xˈaɪ
hei xˈeɪ
hao xˈaʊ
hou xˈoʊ
han xˈan
hen xˈən
hang xˈɑŋ
heng xˈəŋ
hong xˈʊŋ
hu xˈu
hua xˈwa
huo xˈwɔ
huai xˈwaɪ
hui xˈweɪ
huan xˈwan
hun xˈwən
huang xˈwɑŋ
a ˈa
o ˈo
e ˈɤ
er ˈɚ
ai ˈaɪ
ei ˈeɪ
ao ˈaʊ
ou ˈoʊ
an ˈan
en ˈən
ang ˈɑŋ
eng ˈəŋ
yi ˈi
ya jˈa
yao jˈaʊ
ye jˈɛ
you jˈoʊ
yan jˈɛn
yin ˈin
yang jˈɑŋ
ying ˈiŋ
yong ˈjʊŋ
wu ˈu
wa wˈa
wo wˈɔ
wai wˈaɪ
wei wˈeɪ
wan wˈan
wen wˈən
wang wˈɑŋ
weng wˈəŋ
yu ˈy
yue ɥˈɛ
yuan ɥˈɛn
yun ɥˈn
hair xˈaɹ
dianr tˈjaɹ
wanr wˈaɹ
nar nˈaɹ
yanr jˈaɹ
huor xˈwɔɹ
duanr tˈwaɹ
lir lˈjɚ
huir xˈwjɚ
zher ʈʂˈɚ
dour xˈɔɹ
weir wˈɚ
kuair kʰˈwaɹ
guanr gˈwɐʴ
shir ʂˈɚ
yuanr ɥˈɚ
jianr tɕˈjɚ
her xˈɚ
jiar tɕˈjaɹ

bor pˈwɔɹ
xir ɕˈɚ
bianr pˈjɚ
fenr fˈɚ
wenr wˈɚ
der tˈɚ
por pʰˈwɔɹ
yuer ɥˈɚ
mingr mˈjɚ
char ʈʂʰˈaɹ
xingr ɕˈjɚ
zhour ʈʂˈoʊɹ
shour ʂˈoʊɹ
ter tʰˈɚ
yingr ˈjɚ
paor pʰˈaɹ
fangr fˈɑɹ
jingr tɕˈjɚ
shur ʂˈuɹ
qunr tɕʰˈyɹ
hur xˈuɹ
miaor mˈjaʊɹ
biaor pˈjaʊɹ
zhengr ʈʂˈɚ
gour kˈoʊɹ
pair pʰˈaɹ
renr ɹˈɚ
gaor kˈaʊɹ
lo lˈoʊ
tuir tʰˈwɚ
huanr xˈwaɹ
genr kˈɚ
nvr nˈyɹ
qianr tɕʰˈjɚ
hangr xˈɑɹ
chenr ʈʂʰˈɚ
den tˈɚ
lar lˈaɹ
niur nˈjoʊɹ
liur lˈjoʊɹ
tunr tʰˈwɚ
lunr lˈwɚ
tour tʰˈoʊɹ
hour xˈoʊɹ
tianr tʰˈjɚ
mianr mˈjɚ
mar mˈaɹ
pianr pʰˈjɚ
maor mˈaʊɹ
cair tsʰˈɚ
far fˈaɹ
shuor ʂˈwɔɹ
kanr kʰˈaɹ
banr pˈaɹ
ger kˈɚ
sher ʂˈɚ
gunr kˈwɚ
beir pˈɚ
chuanr ʈʂʰˈwɚ
bar pˈaɹ
cunr tsʰˈwɚ
tiaor tʰˈjaʊɹ
shuar ʂˈwaɹ
tur tʰˈuɹ
zhaor ʈʂˈaʊɹ
cher ʈʂʰˈɚ
menr mˈɚ
qingr tɕʰˈjɚ
shanr ʂˈaɹ
mor mˈwɔɹ
zhur ʈʂˈuɹ
wangr wˈɑɹ
zhunr ʈʂˈwɚ
zhir ʈʂˈɚ
haor xˈaʊɹ
shuir ʂˈwɚ
guor kˈwɔɹ
zaor tsˈaʊɹ
juanr tɕˈyɚ
jiar tɕˈjaɹ
xiaor ɕˈjaʊɹ
suor sˈwɔɹ
shaor ʂˈaʊɹ
yir ˈɚ
dir tˈɚ
ganr kˈaɹ
duir tˈwɚ
taor tʰˈaʊɹ
lianr lˈjɚ
benr pˈɚ
fanr fˈaɹ
xuer ɕˈyɚ
pur pʰˈuɹ
jinr tɕˈɚ
kour kʰˈoʊɹ
ker kʰˈɚ
mur mˈuɹ
liaor lˈjaʊɹ
juer tɕˈyɚ
your jˈoʊɹ
xianr ɕˈjɚ
quanr tɕʰˈyɚ
yo jˈoʊ
sanr sˈaɹ
zhuor ʈʂˈwɔɹ
tuor tʰˈwɔɹ
naor nˈaʊɹ
dar tˈaɹ
fur fˈuɹ
dunr tˈwɚ
langr lˈɑɹ
dair tˈaɹ
huar xˈwaɹ
yangr jˈɑɹ
  1. You need to add a tone embedding for languages like Chinese and Japanese. For example, replacing the ProsodyPredictor with the following code (i.e. concatenating the prosody embedding with the text embedding):
class ProsodyPredictor(nn.Module):

    def __init__(self, n_prods, prod_embd, style_dim, d_hid, nlayers, dropout=0.1):
        super().__init__() 
        self.embedding = nn.Embedding(n_prods, prod_embd * 2)
        self.text_encoder = DurationEncoder(sty_dim=style_dim, 
                                            d_model=d_hid,
                                            nlayers=nlayers, 
                                            dropout=dropout)

        self.lstm = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)
        
        self.lstm = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)
        
        self.shared = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.F0 = nn.ModuleList()
        self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))

        self.N = nn.ModuleList()
        self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
        
        self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
        self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)


    def forward(self, texts, prosody, style, text_lengths, alignment, m):
        prosody = self.embedding(prosody)
        texts = torch.cat([texts, prosody], axis=1)
        d = self.text_encoder(texts, style, text_lengths, m)
        
        batch_size = d.shape[0]
        text_size = d.shape[1]
        
        # predict duration
        input_lengths = text_lengths.cpu().numpy()
        x = nn.utils.rnn.pack_padded_sequence(
            d, input_lengths, batch_first=True, enforce_sorted=False)
        
        m = m.to(text_lengths.device).unsqueeze(1)
        
        self.lstm.flatten_parameters()
        x, _ = self.lstm(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(
            x, batch_first=True)
        
        x_pad = torch.zeros([x.shape[0], m.shape[-1], x.shape[-1]])

        x_pad[:, :x.shape[1], :] = x
        x = x_pad.to(x.device)
                
        duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=self.training))
        
        en = (d.transpose(-1, -2) @ alignment)

        return duration.squeeze(-1), en
    
    def F0Ntrain(self, x, s):
        x, _ = self.shared(x.transpose(-1, -2))
        
        F0 = x.transpose(-1, -2)
        for block in self.F0:
            F0 = block(F0, s)
        F0 = self.F0_proj(F0)

        N = x.transpose(-1, -2)
        for block in self.N:
            N = block(N, s)
        N = self.N_proj(N)
        
        return F0.squeeze(1), N.squeeze(1)
    
    def length_to_mask(self, lengths):
        mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
        mask = torch.gt(mask+1, lengths.unsqueeze(1))
        return mask
  1. Modify meldataset.py to return the tones for each IPA and change your train_list.txt in the following format:
data/aishell/train/wav/SSB1100/SSB11000297.wav|$ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ tˈjɛnˈin jˈoʊʂˈənmˈɤ$|X111114444422 111113333555 44444333 33332222555X|382
data/aishell/train/wav/SSB1567/SSB15670392.wav|$ʂˈʐ fˈuʂˈʐ ʂˈʐlˈɤ ʈʂˈəŋtʰˈi jˈɛˈu$fˈaʈʂˈantˈɤ xˈɤɕˈin tɕʰˈytˈʊŋlˈi$|X444 444444 111444 222223333 44444 11133333555 2221111 111114444444X|274
data/aishell/train/wav/SSB0603/SSB06030228.wav|$xˈwɔtˈjɛn tˈəŋ tˈwɔxˈɑŋjˈɛ tɕˈiɑŋʂˈoʊ pˈwɔtɕˈi$|X333344444 3333 11112222444 1111114444 11112222X|223
data/aishell/train/wav/SSB0588/SSB05880296.wav|$ˈinɥˈɛ ˈiʂˈəŋ sˈwɔˈaɪ$|X111444 441111 3333444X|378
data/aishell/train/wav/SSB0315/SSB03150316.wav|$ʈʂʰˈuɕˈyɛʈʂˈɤ kʰˈɤ ʂˈʐˈjʊŋ tɕˈjaʊʈʂʰˈɑŋtˈɤ ˈiɕˈjɛ tʰˈjaʊʂˈəŋ$|X1111122223333 2222 3334444 444444222222555 441111 4444442222X|241
data/aishell/train/wav/SSB0631/SSB06310452.wav|$ɕˈjɛntsˈaɪ tɕˈitɕʰˈi ɕˈyɛxˈweɪ kˈənɹˈən kˈoʊtʰˈʊŋ$|X4444444444 111144444 222244444 11112222 111111111X|229
data/aishell/train/wav/SSB1935/SSB19350402.wav|$xˈwansˈwɔtˈɤ ʂˈʐ tɕʰˈyʂˈʐ lˈaʊpˈaɹtˈɤsˈwən tsˈɹ$|X444443333555 444 44444444 3333444455511111 5555X|345
data/aishell/train/wav/SSB1203/SSB12030292.wav|$pˈiɹˈu tsˈweɪtɕˈin sˈannˈjɛn tɕˈiŋˈiŋ ʈʂˈwɑŋkʰˈwɑŋ lˈiɑŋxˈaʊtˈəŋ$|X333222 44444444444 111122222 11111222 444444444444 2222222223333X|377
data/aishell/train/wav/SSB1024/SSB10240312.wav|$xˈaˈɚ pˈinʂˈʐ tˈiˈu sˈɹʈʂˈʊŋɕˈyɛtˈɤ ʈʂˈaʊpʰˈaɪ ˈy pʰˈɑŋpˈjɛn ʂˈɑŋxˈu ɕˈiɑŋpˈi$|X11133 1111444 44433 444111112222555 1111155555 33 2222211111 1111444 11111333X|231
data/jvs_ver1/jvs088/parallel100/wav24kHz16bit/VOICEACTRESS100_037.wav|$kˈomˈʲɯːɴ ɯˈa $ sˈeːnˈɯ gˈaɯˈa tˈo $ esˈo ɴ nˈɯ kˈaɯˈa nˈo $ gˈoːɽˈʲɯː tɕˈitˈeɴ tˈo nˈaʔ tˈe iɽˈɯ$|XLLLHHHHLL LLL X LLLHHHH LLLLLL LLL X LHHH H HHH LLLLLL LLL X LLLHHHHHH HHHHLLLL LLL HHHL LLL LHHHX|88
data/aishell/train/wav/SSB0671/SSB06710188.wav|$tɕˈiɑŋɕˈjɛn nˈanfˈan ˈjʊŋxˈʊŋ fˈɑŋɕˈin lˈiɑŋjˈoʊ ʈʂˈʊŋɕˈintˈjɛn$|X44444444444 22222222 33332222 44441111 222222222 11111111144444X|363
data/aishell/train/wav/SSB0380/SSB03800184.wav|$kʰˈɤ ɥˈɛxˈan tɕˈjoʊʂˈʐ tʰˈiŋpˈu tɕˈintɕʰˈy$|X3333 1114444 444444444 11111222 4444444444X|323
data/aishell/train/wav/SSB0760/SSB07600247.wav|$tˈɑŋɹˈan wˈɔ ɕˈjɛntsˈaɪ ˈitɕˈiŋ mˈeɪjˈoʊ ʈʂˈɤkˈɤ tsˈɹkˈɤ tsˈaɪkˈən nˈiʂˈwɔ ʈʂˈɤkˈɤ xˈwaɹ$|X11112222 333 4444444444 3311111 22223333 4444444 1111222 444441111 3331111 4444444 44444X|237
data/aishell/train/wav/SSB0016/SSB00160083.wav|$pˈaʂˈʐˈutˈjɛn lˈjoʊlˈiŋtɕʰˈi$|X1112222233333 44444222211111X|245

where X and $ represent the SOS and EOS.

I'll leave this issue open for someone to fork the repo and modify it for Mandarin and Japanese support. I'm unfortunately too busy to work on it now.

For Japanese, you can do the same thing:

The conversion table from kana to IPA is the following (again phonemizer doesn't work for me).

kana_mapper = OrderedDict([
    ("ゔぁ","bˈa"),
    ("ゔぃ","bˈi"),
    ("ゔぇ","bˈe"),
    ("ゔぉ","bˈo"),
    ("ゔゃ","bˈʲa"),
    ("ゔゅ","bˈʲɯ"),
    ("ゔゃ","bˈʲa"),
    ("ゔょ","bˈʲo"),

    ("ゔ","bˈɯ"),

    ("あぁ","aː"),
    ("いぃ","iː"),
    ("いぇ","je"),
    ("いゃ","ja"),
    ("うぅ","ɯː"),
    ("えぇ","eː"),
    ("おぉ","oː"),
    ("かぁ","kˈaː"),
    ("きぃ","kˈiː"),
    ("くぅ","kˈɯː"),
    ("くゃ","kˈa"),
    ("くゅ","kˈʲɯ"),
    ("くょ","kˈʲo"),
    ("けぇ","kˈeː"),
    ("こぉ","kˈoː"),
    ("がぁ","gˈaː"),
    ("ぎぃ","gˈiː"),
    ("ぐぅ","gˈɯː"),
    ("ぐゃ","gˈʲa"),
    ("ぐゅ","gˈʲɯ"),
    ("ぐょ","gˈʲo"),
    ("げぇ","gˈeː"),
    ("ごぉ","gˈoː"),
    ("さぁ","sˈaː"),
    ("しぃ","ɕˈiː"),
    ("すぅ","sˈɯː"),
    ("すゃ","sˈʲa"),
    ("すゅ","sˈʲɯ"),
    ("すょ","sˈʲo"),
    ("せぇ","sˈeː"),
    ("そぉ","sˈoː"),
    ("ざぁ","zˈaː"),
    ("じぃ","dʑˈiː"),
    ("ずぅ","zˈɯː"),
    ("ずゃ","zˈʲa"),
    ("ずゅ","zˈʲɯ"),
    ("ずょ","zˈʲo"),
    ("ぜぇ","zˈeː"),
    ("ぞぉ","zˈeː"),
    ("たぁ","tˈaː"),
    ("ちぃ","tɕˈiː"),
    ("つぁ","tsˈa"),
    ("つぃ","tsˈi"),
    ("つぅ","tsˈɯː"),
    ("つゃ","tɕˈa"),
    ("つゅ","tɕˈɯ"),
    ("つょ","tɕˈo"),
    ("つぇ","tsˈe"),
    ("つぉ","tsˈo"),
    ("てぇ","tˈeː"),
    ("とぉ","tˈoː"),
    ("だぁ","dˈaː"),
    ("ぢぃ","dʑˈiː"),
    ("づぅ","dˈɯː"),
    ("づゃ","zˈʲa"),
    ("づゅ","zˈʲɯ"),
    ("づょ","zˈʲo"),
    ("でぇ","dˈeː"),
    ("どぉ","dˈoː"),
    ("なぁ","nˈaː"),
    ("にぃ","nˈiː"),
    ("ぬぅ","nˈɯː"),
    ("ぬゃ","nˈʲa"),
    ("ぬゅ","nˈʲɯ"),
    ("ぬょ","nˈʲo"),
    ("ねぇ","nˈeː"),
    ("のぉ","nˈoː"),
    ("はぁ","hˈaː"),
    ("ひぃ","çˈiː"),
    ("ふぅ","ɸˈɯː"),
    ("ふゃ","ɸˈʲa"),
    ("ふゅ","ɸˈʲɯ"),
    ("ふょ","ɸˈʲo"),
    ("へぇ","hˈeː"),
    ("ほぉ","hˈoː"),
    ("ばぁ","bˈaː"),
    ("びぃ","bˈiː"),
    ("ぶぅ","bˈɯː"),
    ("ふゃ","ɸˈʲa"),
    ("ぶゅ","bˈʲɯ"),
    ("ふょ","ɸˈʲo"),
    ("べぇ","bˈeː"),
    ("ぼぉ","bˈoː"),
    ("ぱぁ","pˈaː"),
    ("ぴぃ","pˈiː"),
    ("ぷぅ","pˈɯː"),
    ("ぷゃ","pˈʲa"),
    ("ぷゅ","pˈʲɯ"),
    ("ぷょ","pˈʲo"),
    ("ぺぇ","pˈeː"),
    ("ぽぉ","pˈoː"),
    ("まぁ","mˈaː"),
    ("みぃ","mˈiː"),
    ("むぅ","mˈɯː"),
    ("むゃ","mˈʲa"),
    ("むゅ","mˈʲɯ"),
    ("むょ","mˈʲo"),
    ("めぇ","mˈeː"),
    ("もぉ","mˈoː"),
    ("やぁ","jˈaː"),
    ("ゆぅ","jˈɯː"),
    ("ゆゃ","jˈaː"),
    ("ゆゅ","jˈɯː"),
    ("ゆょ","jˈoː"),
    ("よぉ","jˈoː"),
    ("らぁ","ɽˈaː"),
    ("りぃ","ɽˈiː"),
    ("るぅ","ɽˈɯː"),
    ("るゃ","ɽˈʲa"),
    ("るゅ","ɽˈʲɯ"),
    ("るょ","ɽˈʲo"),
    ("れぇ","ɽˈeː"),
    ("ろぉ","ɽˈoː"),
    ("わぁ","ɯˈaː"),
    ("をぉ","oː"),

    ("う゛","bˈɯ"),
    ("でぃ","dˈi"),
    ("でぇ","dˈeː"),
    ("でゃ","dˈʲa"),
    ("でゅ","dˈʲɯ"),
    ("でょ","dˈʲo"),
    ("てぃ","tˈi"),
    ("てぇ","tˈeː"),
    ("てゃ","tˈʲa"),
    ("てゅ","tˈʲɯ"),
    ("てょ","tˈʲo"),
    ("すぃ","sˈi"),
    ("ずぁ","zˈɯa"),
    ("ずぃ","zˈi"),
    ("ずぅ","zˈɯ"),
    ("ずゃ","zˈʲa"),
    ("ずゅ","zˈʲɯ"),
    ("ずょ","zˈʲo"),
    ("ずぇ","zˈe"),
    ("ずぉ","zˈo"),
    ("きゃ","kˈʲa"),
    ("きゅ","kˈʲɯ"),
    ("きょ","kˈʲo"),
    ("しゃ","ɕˈʲa"),
    ("しゅ","ɕˈʲɯ"),
    ("しぇ","ɕˈʲe"),
    ("しょ","ɕˈʲo"),
    ("ちゃ","tɕˈa"),
    ("ちゅ","tɕˈɯ"),
    ("ちぇ","tɕˈe"),
    ("ちょ","tɕˈo"),
    ("とぅ","tˈɯ"),
    ("とゃ","tˈʲa"),
    ("とゅ","tˈʲɯ"),
    ("とょ","tˈʲo"),
    ("どぁ","dˈoa"),
    ("どぅ","dˈɯ"),
    ("どゃ","dˈʲa"),
    ("どゅ","dˈʲɯ"),
    ("どょ","dˈʲo"),
    ("どぉ","dˈoː"),
    ("にゃ","nˈʲa"),
    ("にゅ","nˈʲɯ"),
    ("にょ","nˈʲo"),
    ("ひゃ","çˈʲa"),
    ("ひゅ","çˈʲɯ"),
    ("ひょ","çˈʲo"),
    ("みゃ","mˈʲa"),
    ("みゅ","mˈʲɯ"),
    ("みょ","mˈʲo"),
    ("りゃ","ɽˈʲa"),
    ("りぇ","ɽˈʲe"),
    ("りゅ","ɽˈʲɯ"),
    ("りょ","ɽˈʲo"),
    ("ぎゃ","gˈʲa"),
    ("ぎゅ","gˈʲɯ"),
    ("ぎょ","gˈʲo"),
    ("ぢぇ","dʑˈe"),
    ("ぢゃ","dʑˈa"),
    ("ぢゅ","dʑˈɯ"),
    ("ぢょ","dʑˈo"),
    ("じぇ","dʑˈe"),
    ("じゃ","dʑˈa"),
    ("じゅ","dʑˈɯ"),
    ("じょ","dʑˈo"),
    ("びゃ","bˈʲa"),
    ("びゅ","bˈʲɯ"),
    ("びょ","bˈʲo"),
    ("ぴゃ","pˈʲa"),
    ("ぴゅ","pˈʲɯ"),
    ("ぴょ","pˈʲo"),
    ("うぁ","ɯˈa"),
    ("うぃ","ɯˈi"),
    ("うぇ","ɯˈe"),
    ("うぉ","ɯˈo"),
    ("うゃ","ɯˈʲa"),
    ("うゅ","ɯˈʲɯ"),
    ("うょ","ɯˈʲo"),
    ("ふぁ","ɸˈa"),
    ("ふぃ","ɸˈi"),
    ("ふぅ","ɸˈɯ"),
    ("ふゃ","ɸˈʲa"),
    ("ふゅ","ɸˈʲɯ"),
    ("ふょ","ɸˈʲo"),
    ("ふぇ","ɸˈe"),
    ("ふぉ","ɸˈo"),

    ("あ","a"),
    ("い","i"),
    ("う","ɯ"),
    ("え","e"),
    ("お","o"),
    ("か","kˈa"),
    ("き","kˈi"),
    ("く","kˈɯ"),
    ("け","kˈe"),
    ("こ","kˈo"),
    ("さ","sˈa"),
    ("し","ɕˈi"),
    ("す","sˈɯ"),
    ("せ","sˈe"),
    ("そ","sˈo"),
    ("た","tˈa"),
    ("ち","tɕˈi"),
    ("つ","tsˈɯ"),
    ("て","tˈe"),
    ("と","tˈo"),
    ("な","nˈa"),
    ("に","nˈi"),
    ("ぬ","nˈɯ"),
    ("ね","nˈe"),
    ("の","nˈo"),
    ("は","hˈa"),
    ("ひ","çˈi"),
    ("ふ","ɸˈɯ"),
    ("へ","hˈe"),
    ("ほ","hˈo"),
    ("ま","mˈa"),
    ("み","mˈi"),
    ("む","mˈɯ"),
    ("め","mˈe"),
    ("も","mˈo"),
    ("ら","ɽˈa"),
    ("り","ɽˈi"),
    ("る","ɽˈɯ"),
    ("れ","ɽˈe"),
    ("ろ","ɽˈo"),
    ("が","gˈa"),
    ("ぎ","gˈi"),
    ("ぐ","gˈɯ"),
    ("げ","gˈe"),
    ("ご","gˈo"),
    ("ざ","zˈa"),
    ("じ","dʑˈi"),
    ("ず","zˈɯ"),
    ("ぜ","zˈe"),
    ("ぞ","zˈo"),
    ("だ","dˈa"),
    ("ぢ","dʑˈi"),
    ("づ","zˈɯ"),
    ("で","dˈe"),
    ("ど","dˈo"),
    ("ば","bˈa"),
    ("び","bˈi"),
    ("ぶ","bˈɯ"),
    ("べ","bˈe"),
    ("ぼ","bˈo"),
    ("ぱ","pˈa"),
    ("ぴ","pˈi"),
    ("ぷ","pˈɯ"),
    ("ぺ","pˈe"),
    ("ぽ","pˈo"),
    ("や","jˈa"),
    ("ゆ","jˈɯ"),
    ("よ","jˈo"),
    ("わ","ɯˈa"),
    ("ゐ","i"),
    ("ゑ","e"),
    ("ん","ɴ"),
    ("っ","ʔ"),
    ("ー","ː"),

    ("ぁ","a"),
    ("ぃ","i"),
    ("ぅ","ɯ"),
    ("ぇ","e"),
    ("ぉ","o"),
    ("ゎ","ɯˈa"),
    ("ぉ","o"),

    ("を","o")
])

nasal_sound = OrderedDict([
    # before m, p, b
    ("ɴm","mm"),
    ("ɴb", "mb"),
    ("ɴp", "mp"),
    
    # before k, g
    ("ɴk","ŋk"),
    ("ɴg", "ŋg"),
    
    # before t, d, n, s, z, ɽ
    ("ɴt","nt"),
    ("ɴd", "nd"),
    ("ɴn","nn"),
    ("ɴs", "ns"),
    ("ɴz","nz"),
    ("ɴɽ", "nɽ"),
    
    ("ɴɲ", "ɲɲ"),
    
])

def hiragana2IPA(text):
    orig = text

    for k, v in kana_mapper.items():
        text = text.replace(k, v)

    for k, v in nasal_sound.items():
        text = text.replace(k, v)
        
    return text

You also need to add the intonations for each word with Open JTalk.

data/jvs_ver1/jvs020/falset10/wav24kHz16bit/VOICEACTRESS100_005.wav|$ɕˈiɽˈɯbˈaː sˈaː ɸˈaː ɕˈʲɯːgˈekˈi dʑˈikˈeɴ mˈadˈe nˈi $ ɽˈitɕˈaːzˈɯ ɯˈa $ tɕˈiːmˈɯ mˈeː tˈo tˈomˈonˈi $ kˈokˈɯsˈai tˈekˈi nˈi sˈɯːpˈaː çˈiːɽˈoː$ ojˈobˈi $ jˈɯːmˈeːdʑˈiɴ tˈo ɕˈi tˈe $ nˈiɴtɕˈi sˈa ɽˈe tˈe iɽˈɯ$|XHHHLLLLLLL LLLH HHHH HHHHHHHHHHH HHHHLLLL LLLLLL LLL X HHHLLLLLLLL LLL X LLLLHHHH LLLL LLL LLLHHHHHH X LLLHHHHHHH LLLLLL LLL LLLHHHHH HHHLLLLLX HLLLLLL X LLLHHHHLLLLLL LLL LLL HHH X HHHLLLLL LLL HHH HHH LHHHX
data/jvs_ver1/jvs081/parallel100/wav24kHz16bit/VOICEACTRESS100_078.wav|$ɸˈʲoːgˈeɴ gˈʲoːɽˈetsˈɯ nˈo ɕˈiɸˈʲoː ɸˈʲoː o$ bˈɯɴɕˈi nˈo tˈaiɕˈʲoː sˈeː o aɽˈaɯˈasˈɯ $ tˈeɴ gˈɯɴ nˈo ɕˈiɸˈʲoː ɸˈʲoː o mˈotɕˈiː tˈe $ sˈɯɴdˈe jˈakˈɯ ɸˈʲoːgˈeɴ e bˈɯɴkˈai sˈɯɽˈɯ$|XLLLLHHHHH HHHHLLLLLLLL LLL LLLHHHHH HHHHH HX HHHLLLL LLL LLLHHHHHH HHHH H LHHHHHHLLL X LLLH LLLL LLL LLLHHHHH HHHHH H LLLHHHHH LLL X LLLHHHH LLLHHH HHHHHHHHL L LLLHHHHH LLLHHHX

where L and H represent low tone and high tone, respectively.

data/VCTK-Corpus/VCTK-Corpus/wav24/p275/p275_380.wav|$ɪts ɐ ɹˈiːəl pɹˈɑːbləm$$|XXXX X XXXXXX XXXXXXXXXXX|155
hello,I want to know what "XXXX X XXXXXX XXXXXXXXXXX" and 155 mean.
Thanks!

@c9412600 That was a typo that should not be included, I have fixed it. 155 is the speaker id (never used during training, just for clarification), and X means no intonation (in contrast to 1, 2, 3, 4, 5 that represent the actual tones in Mandarin).

@yl4579 Thank you for sharing so many ideas! Use Aishell3 dataset, I can synthesize normal audio, and it sounds good.

But when generate a unseen speaker, the timbre doesn't sound like its origin, is there any way to improve its timbre similarity to unseen speaker?

@yl4579 I would like to ask if there is any change to the Vietnamese language

@CONGLUONG12 I don't think there is any change needed for Vietnamese. You only need to find a conversion table between chu quoc ngu and IPA (maybe phonemizer works for this case?) and label the tones (there should be six of them, so n_prods = 6) as in Mandarin.

I have some questions about how to inference in mandarin .
First ,I am not sure if it is right for mandarin :

_pad = "$"
_punctuation = ';:,.!?¡¿—…"«»“” '
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
_letters_ipa = "ɹʂʴʰɛɤɔʈɚˈɕɥɐɑɪŋʐʊə"

Second:
ps = global_phonemizer.phonemize([text])
do i need to add tone in ps,like
'$pˈu ʈʂˈʐ tˈaʊ nˈi ʂˈwɔ tˈɤ ʂˈʐ pˈu ʂˈʐ wˈɔ ɕˈiɑŋ tˈɤ$|X444 1111 4444 333 1111 555 444 444 444 333 33333 555X'

if my asr trained with pinyin(like 'wo3 shi4 shui2') not ipa, is it ok for the inference?
THANK YOU for the great work!
@yl4579 @yl4579 @yl4579

I use pinyin for asr and styletts, can generate a normal and good results.

I use pinyin for asr and styletts, can generate a normal and good results.

could you share some details like:
how to set inference file

_pad = "$"
_punctuation = ';:,.!?¡¿—…"«»“” '
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
_letters_ipa = "1234"
is it right?
and if you need to replace class ProsodyPredictor like the author said?
@liuhuang31

For mandarin, i didn't use ipa_phonemes, use pinyin's initials and finals phonemes.

  1. You can use pypinyin to generate pinyin.
  2. The _initials and _finals used in pypinyin, then the symbols is below:

_pause = ["sil", "eos", "sp", ...]
_initials = ["b", "c","ch", ...]
_finals = ["a", "ai", ...]
_tones = ["1", "2", "3", "4", "5"]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

For mandarin, i didn't use ipa_phonemes, use pinyin's initials and finals phonemes.

  1. You can use pypinyin to generate pinyin.
  2. The _initials and _finals used in pypinyin, then the symbols is below:

_pause = ["sil", "eos", "sp", ...] _initials = ["b", "c","ch", ...] _finals = ["a", "ai", ...] _tones = ["1", "2", "3", "4", "5"] symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

thank you very much!
do you change <class ProsodyPredictor(nn.Module) > in models?
@liuhuang31

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31
How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme?
and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31 How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme? and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

HI, JohnHerry
(1) I didn't train phoneme level bert model. In the below, ShengMu is _initials, YunMu is _finals, YinDiao is _tones.
For text features: phoneme, prosody, tone.
phoneme features treat the ShengMu, YunMu as separate phonemes.

For example, give a text “去上学校”:
First we generate its prosody: “去上学校” -> “去#1上#1学校#4.”
Second use pypinyin to generate chinese pinyin: “去#1上#1学校#4.” -> “去#1上#1学校#4.|qu5 shang5 xue3 xiao3”
Third generate its text features(phoneme, prosody, tone): “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” -> "q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3". Certainly, you should convert phoneme, prosody and tone to id.

_pause = ["sil", "eos", "sp", ...]
_initials = ["b", "c","ch", ...]
_finals = ["a", "ai", ...]
_tones = ["1", "2", "3", "4", "5"]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

(2) As for me, just use the open dataset: aishell3 dataset(zhvoice dataset also can use, but its quality is very poor).

(3) Yes, "Pypinyin generated pinyin are error-prone", but in my view, if the dataset is big enough, the error will be average and "eliminate". Also in my experiment use aishell3 dataset, i can generate a normal audio, which sound not bad.

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31 How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme? and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

HI, JohnHerry (1) I didn't train phoneme level bert model. In the below, ShengMu is _initials, YunMu is _finals, YinDiao is _tones. For text features: phoneme, prosody, tone. phoneme features treat the ShengMu, YunMu as separate phonemes.

For example, give a text “去上学校”: First we generate its prosody: “去上学校” -> “去#1上#1学校#4.” Second use pypinyin to generate chinese pinyin: “去#1上#1学校#4.” -> “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” Third generate its text features(phoneme, prosody, tone): “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” -> "q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3". Certainly, you should convert phoneme, prosody and tone to id.

_pause = ["sil", "eos", "sp", ...]
_initials = ["b", "c","ch", ...]
_finals = ["a", "ai", ...]
_tones = ["1", "2", "3", "4", "5"]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

(2) As for me, just use the open dataset: aishell3 dataset(zhvoice dataset also can use, but its quality is very poor).

(3) Yes, "Pypinyin generated pinyin are error-prone", but in my view, if the dataset is big enough, the error will be average and "eliminate". Also in my experiment use aishell3 dataset, i can generate a normal audio, which sound not bad.

Thanks for the detailed information. it helps me a lot.

commented

hello, the pypinyin does not perform well someways. So i use another phoneme set, not like pypinyin. In this way, how can i prepare the filelists and how to train or finetune?

hello, the pypinyin does not perform well someways. So i use another phoneme set, not like pypinyin. In this way, how can i prepare the filelists and how to train or finetune?

Hi, zdj97

As pypinyin or any other phoneme set, their role is to convert text into phoneme, so just samply use the new phoneme set. And remember use the new phoneme set to re-train the asr model.

I did try training for other languages including Mandarin, Japanese, Hindi etc., though it requires a few changes:

hello, what tools did you use to convert the LJspeech and LibriTTS databases to IPAs

commented

hi, i did not convert LJspeech or VCTK to IPAs. So, i did not use the pretrained models in this scripts.
I trained the ASR and pitch models from scrach using my own phoneme set and the results have not done.
When the models are done, i will comments this.

Hi @yl4579 ,

A stupid question: how can I convert

SSB11000297|zhang1 hui4 yi2 % chu1 yan3 de5 % dian4 yin3 % you3 shen2 me5 $|

to

SSB11000297.wav|$ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ tˈjɛnˈin jˈoʊʂˈənmˈɤ$|X111114444422 111113333555 44444333 33332222555X|

Is the conversion done by meldataset.py during training or do I need to write a preprocessor to convert it before training?

Thanks

@yihuitang You need to code it yourself because the meldataset.py was written for English support only. I have provided the conversion table, so it should not be difficult for you to convert it to the desired format. I couldn't find the exact code to generate the dataset unfortunately, but all you need is to split the text by space, get the number (tone), convert the pinyin to IPA using the table I provided and repeat the number (tone) for N times where N is the length of the phoneme.

@yl4579 thanks for your prompt reply. I'll start with the code of converting the format.
Should there be a space between IPAs? Take zhang1 hui4 as an example, which of the following representations of IPA is the correct or best one?

1. ʈʂˈɑŋxˈweɪˈ (no sapce)
2. ʈʂˈɑŋ xˈweɪˈ (space between words)
3. ʈʂˈ ɑŋ xˈ weɪˈ (space between words and space between ShengMu and YunMu)
4. ʈ ʂˈ ɑ ŋ xˈ w e ɪˈ (space between each IPA)

@yihuitang In my case, I separated between words because I used the PL-BERT trained jointly with Chinese, Japanese, and English and word boundaries were used when pre-training the PL-BERT, but you may not need to do that. If you do not plan to use any language model or if your language model is at character level (for example, your grapheme in PL-BERT is the character instead of a word), I don't think there is any difference.

Note that words were separated by "%" in the AiShell dataset, so "zhang1 hui4 yi2" is one word, and "chu1 yan3 de5" is another word. This is why they were converted to "ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ" in my case, where the only space is between these two words, not syllables.

@yl4579 , Thanks for your guidance. I do plan to use your PL-BERT later if I can successfully implement Mandarin in StyleTTS.

I would also like to train StyleTTS with customized data, which has no "%" in the dataset. So for the customized dataset with PL-BERT, I should use option 1 (no space). Am I right?

1. ʈʂˈɑŋxˈweɪˈ (no sapce)

Hi @yl4579 , a quick update:

I've created a script to convert pinyins to IPAs and get filelists in the desired format for Mandarin. Here are train and val lists for aishell3 dataset:
train_list_aishell3.txt
val_list_aishell3.txt

Class ProsodyPredictor is also updated with your code above. And then I tried to update meldataset.py but got stuck.

  1. What should n_prods and prod_embd be? Should they be stored in config.yml?
  2. I can get the tone for IPA, but where and how I should use the tone?
class FilePathDataset(torch.utils.data.Dataset):
    def __init__(self,
                 data_list,
                 sr=24000,
                 data_augmentation=False,
                 validation=False,
                 ):

        spect_params = SPECT_PARAMS
        mel_params = MEL_PARAMS

        #_data_list = [l[:-1].split('|') for l in data_list]
        _data_list = [l.split('|') for l in data_list]
        self.data_list = [data if len(data) == 4 else (*data, 0) for data in _data_list]
        self.text_cleaner = TextCleaner()
        self.sr = sr

        self.to_melspec = torchaudio.transforms.MelSpectrogram(**MEL_PARAMS)

        self.mean, self.std = -4, 4
        self.data_augmentation = data_augmentation and (not validation)
        self.max_mel_length = 192

#         self.global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True)

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, idx):
        data = self.data_list[idx]
        path = data[0]

        wave, text_tensor, tone_tensor, speaker_id = self._load_tensor(data)

        mel_tensor = preprocess(wave).squeeze()

        acoustic_feature = mel_tensor.squeeze()
        length_feature = acoustic_feature.size(1)
        acoustic_feature = acoustic_feature[:, :(length_feature - length_feature % 2)]

        return speaker_id, acoustic_feature, text_tensor, path

    def _load_tensor(self, data):
        wave_path, text, tone, speaker_id = data
        speaker_id = int(speaker_id)
        wave, sr = sf.read(wave_path)
        if wave.shape[-1] == 2:
            wave = wave[:, 0].squeeze()
        if sr != 24000:
            wave = librosa.resample(wave, sr, 24000)
            print(wave_path, sr)

        wave = np.concatenate([np.zeros([5000]), wave, np.zeros([5000])], axis=0)

        text = self.text_cleaner(text)
        tone = self.text_cleaner(tone)

        text.insert(0, 0)
        text.append(0)

        tone.insert(0, 0)
        tone.append(0)

        text = torch.LongTensor(text)
        tone = torch.LongTensor(tone)

        return wave, text, tone, speaker_id

    def _load_data(self, data):
        wave, text_tensor, tone, speaker_id = self._load_tensor(data)
        mel_tensor = preprocess(wave).squeeze()

        mel_length = mel_tensor.size(1)
        if mel_length > self.max_mel_length:
            random_start = np.random.randint(0, mel_length - self.max_mel_length)
            mel_tensor = mel_tensor[:, random_start:random_start + self.max_mel_length]

        return mel_tensor, speaker_id

@yl4579 I have the same question as @yihuitang. What is the reasonable prod_embd? And How to use it in training?

@yihuitang n_prods should be the number of tones (e.g., for Mandarin Chinese it should be 5, for Japanese it should be 2, for Cantonese it should be 6).
The tones are represented as indices and encoded with one-hot encoding and converted to the prosody embedding self.embedding in the modified ProsodyPredictor.
@yihuitang I used 128. It shouldn't matter that much to be honest.

Then, may be somewhat out of this project, but is there any good text-to-prosody solution for Mandarin?

most of the text prosody models are BERT+, eg. BERT+Linear, BERT+BiLSTM+CRF, .....
But in our experiments, those models are not very good for mandarin, especially the #2, prosody segment, is very hard to predict. We had also tried other methods, none of them is good enough. I think it is because the #2 prosody is sparse in the training data, and so got slower convergence. We had tried to weight its loss function but got no improments. We tried cascade model structure like in "A Mandarin Prosodic Boundary Prediction Model Based on Multi-Task Learning " , it is also not very effective.

@JohnHerry I'm a little confused what #2 "prosody segment" is? I guess the issue (#2 ) only involves multilingual support for phonemization, not sure why it is related to prosody prediction. Do you mean the 2nd tone in Mandarin? From what I found online, it says "With regard to lexical tone, the falling Tone 4 is the most frequent (34.9%), followed by the stable high Tone 1 (24.8%) and rising Tone 2 (23.9%). The low-dipping Tone 3 is least frequent tone in our corpus (16.4%)." so I guess they are quite equally distributed.

@yl4579 where the one-hot encoding should happen? In the modified ProsodyPredictor or in the modified meldataset.py?

@JohnHerry I'm a little confused what #2 "prosody segment" is? I guess the issue (#2 ) only involves multilingual support for phonemization, not sure why it is related to prosody prediction. Do you mean the 2nd tone in Mandarin? From what I found online, it says "With regard to lexical tone, the falling Tone 4 is the most frequent (34.9%), followed by the stable high Tone 1 (24.8%) and rising Tone 2 (23.9%). The low-dipping Tone 3 is least frequent tone in our corpus (16.4%)." so I guess they are quite equally distributed.

No, they are not tones in phoneme pinyin, they are text prosody labels.
The #1 #2 and #3 are prosody tags on text, where:
#1 is Prosodic Word (PW),
#2 is Prosodic Phrase (PPH)
and #3 is Intonational Phrase (IPH)
eg. 玄奘#1为保存#2由#1天竺#1经#1丝绸之路#2带回#1长安的#1经卷#1佛像#3主持#1修建了#1大雁塔#4
They somewhat can be seen as a kind of speech pause levels, where #3 will get longer speech pause then #2, Text prosody labels can helps to generate better acounstic prosody in the synthesized speech.
I had read your first answer of this issure, in the code of the "ProsodyPredictor" class, I think, the third parameter of the forward function, is somewhat like that text prosody.

Is there any samples of from Mandrain corpus?

commented

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31 How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme? and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

HI, JohnHerry (1) I didn't train phoneme level bert model. In the below, ShengMu is _initials, YunMu is _finals, YinDiao is _tones. For text features: phoneme, prosody, tone. phoneme features treat the ShengMu, YunMu as separate phonemes.

For example, give a text “去上学校”: First we generate its prosody: “去上学校” -> “去#1上#1学校#4.” Second use pypinyin to generate chinese pinyin: “去#1上#1学校#4.” -> “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” Third generate its text features(phoneme, prosody, tone): “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” -> "q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3". Certainly, you should convert phoneme, prosody and tone to id.

_pause = ["sil", "eos", "sp", ...]
_initials = ["b", "c","ch", ...]
_finals = ["a", "ai", ...]
_tones = ["1", "2", "3", "4", "5"]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

(2) As for me, just use the open dataset: aishell3 dataset(zhvoice dataset also can use, but its quality is very poor).

(3) Yes, "Pypinyin generated pinyin are error-prone", but in my view, if the dataset is big enough, the error will be average and "eliminate". Also in my experiment use aishell3 dataset, i can generate a normal audio, which sound not bad.

Have you insert blank index between the shengmu and yunmu, such as "n i h ao"-->[0 x 0 x 0 x 0 x 0], when you train ASR model?
@liuhuang31

@hdmjdp hi, i didn't insert blank between shengmu and yunmu.

commented

thanks, So when you train ASR model, you did not insert any blank ?

commented

@liuhuang31 As you said, how setting the ctc loss config?
blank_index = train_dataloader.dataset.text_cleaner.word_index_dictionary[" "] # get blank index
criterion = build_criterion(critic_params={
'ctc': {'blank': blank_index},
})

@hdmjdp The train data is as below:
"aishell3/train/wav/SSB0018/audio/00180007.wav|25|sil r an4 #1 d a4 #0 j ia1 #1 k uai4 #0 d ian3 #1 x ia4 #0 l ai2 #4 。 eos"

after process, its data is "blank_ r an blank_ d a blank_ j ia blank_ k uai blank_ d ian blank_ x ia blank_ l ai blank_ 。 blank_"

commented

@yihuitang
I see. All prosody label changed to blank. When you train styleTTS, you are also inserting blank spaces in the phoneme sequence?

@hdmjdp yes, styletts as the same as ASR

@hdmjdp The train data is as below: "aishell3/train/wav/SSB0018/audio/00180007.wav|25|sil r an4 #1 d a4 #0 j ia1 #1 k uai4 #0 d ian3 #1 x ia4 #0 l ai2 #4 。 eos"

after process, its data is "blank_ r an blank_ d a blank_ j ia blank_ k uai blank_ d ian blank_ x ia blank_ l ai blank_ 。 blank_"

@liuhuang31 ,if you process all prosody label into blank, then how to control the pause in tts?

@GuangChen2016 hello,
(1) for #1, add blank: such as "i#1love" -> "i blank_ love".
(2) for #2, use #2 as a phoneme, add blank surround it, such as: "i#2love" -> "i blank_ #2 blank_ love".
(3) for #3 must be followed by punctuation, such as: "i#1love#3,you" -> "i blank_ love blank_ , blank_ you".
(4) for #4, same as #3, which must be followed by punctuation, such as: "i#1love#3,you#4." -> "i blank_ love blank_ , blank_ you blank_ . blank_".

@liuhuang31 I see, thanks you. Could you share one synthesized samples?

@liuhuang31 I see, thanks you. Could you share one synthesized samples?

@GuangChen2016 you can give me several audio for speaker_ref(you have the audio's copyright or a open dataset), and give me several chinese text to generate.

@liuhuang31
Synthesized Text: 杭州亚运会即将在9月开幕,这是继北京冬奥会之后,我国再次承办的一项国际大型体育赛事。然而,在这场盛会上,我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外,无缘参加杭州亚运会。
这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会,而不是对他们进行歧视和限制。我国认为,运动员是否参赛应该由他们自己的体育表现决定,而不是其他因素,包括战争等。我国还表示,愿意为他们搭建一个良好的参赛平台,让他们以中立身份参赛,并且不会影响奖牌的分配。
ref audios as belows:
ref.zip

@liuhuang31 Synthesized Text: 杭州亚运会即将在9月开幕,这是继北京冬奥会之后,我国再次承办的一项国际大型体育赛事。然而,在这场盛会上,我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外,无缘参加杭州亚运会。 这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会,而不是对他们进行歧视和限制。我国认为,运动员是否参赛应该由他们自己的体育表现决定,而不是其他因素,包括战争等。我国还表示,愿意为他们搭建一个良好的参赛平台,让他们以中立身份参赛,并且不会影响奖牌的分配。 ref audios as belows: ref.zip

@GuangChen2016 For some reasons, the ref audio need longer than 9 seconds, so provided audio will copy to 9 seconds.
The generate wave as below:
ref_gen.zip

@GuangChen2016 In addition, ref audio supports any length audio. But when i convert styletts model to onnx model, the ref length is fixed to about 9 seconds, so ref audio need longer than 9 seconds.

@liuhuang31 it is well.Is this open source dataset train's model result ?

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu:
As I know, VCTK is an English dataset, Why do u use it?

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?

Hi Zhang:
Use vctk just support for English pronunciation.

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?

Hi Zhang: Use vctk just support for English pronunciation.

Thanks for ur reply :)
The performance of your model is pretty good, I am going to reproduce it!

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?

Hi Zhang: Use vctk just support for English pronunciation.

Thanks for ur reply :) The performance of your model is pretty good, I am going to reproduce it!

fighting😊

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?

Hi Zhang: Use vctk just support for English pronunciation.

Thanks for ur reply :) The performance of your model is pretty good, I am going to reproduce it!

fighting😊

Do u concat these three datasets on the same train and valid list? and dose it means your meldataset.py can process chinese and english data at the same time?

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?

Hi Zhang: Use vctk just support for English pronunciation.

Thanks for ur reply :) The performance of your model is pretty good, I am going to reproduce it!

fighting😊

Do u concat these three datasets on the same train and valid list? and dose it means your meldataset.py can process chinese and english data at the same time?

These three datasets on the same train and valid list! meldataset.py can process chinese, english, mixed(zh en in one sentence) at the same time!

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?

Hi Zhang: Use vctk just support for English pronunciation.

Thanks for ur reply :) The performance of your model is pretty good, I am going to reproduce it!

fighting😊

Do u concat these three datasets on the same train and valid list? and dose it means your meldataset.py can process chinese and english data at the same time?

These three datasets on the same train and valid list! meldataset.py can process chinese, english, mixed(zh en in one sentence) at the same time!

Okay, can you show part of your list and your modification of the meldataset.py, it is hard for me :)

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?

Hi Zhang: Use vctk just support for English pronunciation.

Thanks for ur reply :) The performance of your model is pretty good, I am going to reproduce it!

fighting😊

Do u concat these three datasets on the same train and valid list? and dose it means your meldataset.py can process chinese and english data at the same time?

These three datasets on the same train and valid list! meldataset.py can process chinese, english, mixed(zh en in one sentence) at the same time!

Okay, can you show part of your list and your modification of the meldataset.py, it is hard for me :)
Sr, the company's project is not very convenient, but something can be shared: the chinese and english use a unify phoneme list; As for tone and prosody's process, you can scroll this question up to see my answer.

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?

Hi Zhang: Use vctk just support for English pronunciation.

Thanks for ur reply :) The performance of your model is pretty good, I am going to reproduce it!

fighting😊

Do u concat these three datasets on the same train and valid list? and dose it means your meldataset.py can process chinese and english data at the same time?

These three datasets on the same train and valid list! meldataset.py can process chinese, english, mixed(zh en in one sentence) at the same time!

Okay, can you show part of your list and your modification of the meldataset.py, it is hard for me :)
Sr, the company's project is not very convenient, but something can be shared: the chinese and english use a unify phoneme list; As for tone and prosody's process, you can scroll this question up to see my answer.

Alright, thank you so much.

@yihuitang n_prods should be the number of tones (e.g., for Mandarin Chinese it should be 5, for Japanese it should be 2, for Cantonese it should be 6). The tones are represented as indices and encoded with one-hot encoding and converted to the prosody embedding self.embedding in the modified ProsodyPredictor. @yihuitang I used 128. It shouldn't matter that much to be honest.

if n_prods set 5 (tones with 1,2,3,4,5),but there is 'X' and ' ' in the tone string, i think n_prods should be 7 because the prosody embedding can not be embedding with n_prods=5 ?

@skysbird I think you are right, it should be 7 instead of 5 because we also have space and X. I have never trained a Chinese only model but the one with Chinese, Japanese and English, so in my settings n_prods = 9, with five tones in Mandarin, 2 tones in Japanese, 0 tone in English and 2 special tokens.

        d = self.text_encoder(texts, style, text_lengths, m)

I think DurationEncoder should be modified , because texts is concated with (texts and prosody) ?

something like the code below?

class DurationEncoder(nn.Module):

def __init__(self, sty_dim, d_model, nlayers, dropout=0.1):
    super().__init__()
    self.lstms = nn.ModuleList()
    for _ in range(nlayers):
        self.lstms.append(nn.LSTM(d_model + sty_dim + prod_embed*2 , **// here is the code?**  
                             d_model // 2, 
                             num_layers=1,
                             batch_first=True,
                             bidirectional=True,
                             dropout=dropout))
        self.lstms.append(AdaLayerNorm(sty_dim, d_model))

@skysbird I think you are right. I have checked the code I have and it is indeed the case, though I didn't change the code for DurationEncoder.

class ProsodyPredictor(nn.Module):

    def __init__(self, n_prods, prod_embd, style_dim, d_hid, nlayers, dropout=0.1):
        super().__init__() 
        self.embedding = nn.Embedding(n_prods, prod_embd * 2)
        self.text_encoder = DurationEncoder(sty_dim=style_dim, 
                                            d_model=d_hid + prod_embd * 2,
                                            nlayers=nlayers, 
                                            dropout=dropout)

        self.lstm = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)
        
        self.shared = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.F0 = nn.ModuleList()
        self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))

        self.N = nn.ModuleList()
        self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
        
        self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
        self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)

    def forward(self, texts, prosody, style, text_lengths, alignment, mel_lengths):
        prosody = self.embedding(prosody).transpose(-1, -2)

        texts = torch.cat([texts, prosody], axis=1).transpose(-1, -2)
        # mask = self.length_to_mask(text_lengths).to('cuda')
        
        d = self.text_encoder(texts, style, text_lengths)
        batch_size = d.shape[0]
        text_size = d.shape[1]
        # predict duration
        input_lengths = text_lengths.cpu().numpy()
        x = nn.utils.rnn.pack_padded_sequence(
            d, input_lengths, batch_first=True, enforce_sorted=False)
        m = self.length_to_mask(text_lengths).to(texts.device).unsqueeze(1)
        self.lstm.flatten_parameters()
        x, _ = self.lstm(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(
            x, batch_first=True)
        duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=self.training))
        
        en = (d.transpose(-1, -2) @ alignment)
        
        return duration.squeeze(-1), en
    
    def F0Ntrain(self, x, s):
        x, _ = self.shared(x.transpose(-1, -2))
        
        F0 = x.transpose(-1, -2)
        for block in self.F0:
            F0 = block(F0, s)
        F0 = self.F0_proj(F0)

        N = x.transpose(-1, -2)
        for block in self.N:
            N = block(N, s)
        N = self.N_proj(N)
        
        return F0.squeeze(1), N.squeeze(1)
    
    def length_to_mask(self, lengths):
        mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
        mask = torch.gt(mask+1, lengths.unsqueeze(1))
        return mask

So, does the Mandarin support now?

@skysbird I think you are right. I have checked the code I have and it is indeed the case, though I didn't change the code for DurationEncoder.

class ProsodyPredictor(nn.Module):

    def __init__(self, n_prods, prod_embd, style_dim, d_hid, nlayers, dropout=0.1):
        super().__init__() 
        self.embedding = nn.Embedding(n_prods, prod_embd * 2)
        self.text_encoder = DurationEncoder(sty_dim=style_dim, 
                                            d_model=d_hid + prod_embd * 2,
                                            nlayers=nlayers, 
                                            dropout=dropout)

        self.lstm = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)
        
        self.shared = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.F0 = nn.ModuleList()
        self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))

        self.N = nn.ModuleList()
        self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
        
        self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
        self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)

    def forward(self, texts, prosody, style, text_lengths, alignment, mel_lengths):
        prosody = self.embedding(prosody).transpose(-1, -2)

        texts = torch.cat([texts, prosody], axis=1).transpose(-1, -2)
        # mask = self.length_to_mask(text_lengths).to('cuda')
        
        d = self.text_encoder(texts, style, text_lengths)
        batch_size = d.shape[0]
        text_size = d.shape[1]
        # predict duration
        input_lengths = text_lengths.cpu().numpy()
        x = nn.utils.rnn.pack_padded_sequence(
            d, input_lengths, batch_first=True, enforce_sorted=False)
        m = self.length_to_mask(text_lengths).to(texts.device).unsqueeze(1)
        self.lstm.flatten_parameters()
        x, _ = self.lstm(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(
            x, batch_first=True)
        duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=self.training))
        
        en = (d.transpose(-1, -2) @ alignment)
        
        return duration.squeeze(-1), en
    
    def F0Ntrain(self, x, s):
        x, _ = self.shared(x.transpose(-1, -2))
        
        F0 = x.transpose(-1, -2)
        for block in self.F0:
            F0 = block(F0, s)
        F0 = self.F0_proj(F0)

        N = x.transpose(-1, -2)
        for block in self.N:
            N = block(N, s)
        N = self.N_proj(N)
        
        return F0.squeeze(1), N.squeeze(1)
    
    def length_to_mask(self, lengths):
        mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
        mask = torch.gt(mask+1, lengths.unsqueeze(1))
        return mask

sorry, can I have your newest DurationEncoder code? I think the code in repository is old.

@skysbird The repo is fine now because it’s a demo repo for English only. For Mandarin I think a separate repo is needed. Maybe someone can clone it and make necessary changes for that.

@yl4579 Hi, I have trained a multi-speker models (both stage1 and stage2) and the synthesized results sound good. I wonder that how to conduct adapt training using the pretrained multi-spaker models. My idea is as follows: contine training first stage models using the target speaker's corpus based on the multi-spaker first stage models. And based on the first stage models of the target speaker model, perform second stage training using the target speaker's corpus.
Is this process correct? or do you have any idea for adapt training using the pretrained multi-speker models? Thanks again.

@GuangChen2016 Yes, this is generally correct for StyleTTS, but for StyleTTS2 fine tuning is a little easier as you only need to run the joint training phase if the corpus is close enough to the distribution of the pre-trained models.

I did try training for other languages including Mandarin, Japanese, Hindi etc., though it requires a few changes:

  1. You need to phonemize Chinese into IPAs. You can use either https://github.com/bootphon/phonemizer or a look-up table to replace Chinese characters into IPAs. The pre-trained text aligner already includes AiShell (Mandarin dataset), with the following IPA conversion table from Pinyin. It may be slightly different from phonemizer, as it didn't work for me for Chinese.
ba pˈa
bo pˈwɔ
bai pˈaɪ
bei pˈeɪ
bao pˈaʊ
ban pˈan
ben pˈən
bang pˈɑŋ
beng pˈəŋ
bi pˈi
biao pˈjaʊ
bie pˈjɛ
bian pˈjɛn
bin pˈin
bing pˈiŋ
bu pˈu
pa pʰˈa
po pʰˈwɔ
pai pʰˈaɪ
pei pʰˈeɪ
pao pʰˈaʊ
pou pʰˈoʊ
pan pʰˈan
pen pʰˈən
pang pʰˈɑŋ
peng pʰˈəŋ
pi pʰˈi
piao pʰˈjaʊ
pie pʰˈjɛ
pian pʰˈjɛn
pin pʰˈin
ping pʰˈiŋ
pu pʰˈu
ma mˈa
me mˈɤ
mo mˈwɔ
mai mˈaɪ
mei mˈeɪ
mao mˈaʊ
mou mˈoʊ
man mˈan
men mˈən
mang mˈɑŋ
meng mˈəŋ
mi mˈi
miao mˈjaʊ
mie mˈjɛ
miu mˈju
mian mˈjɛn
min mˈin
ming mˈiŋ
mu mˈu
fa fˈa
fo fˈwɔ
fei fˈeɪ
fou fˈoʊ
fan fˈan
fen fˈən
fang fˈɑŋ
feng fˈəŋ
fu fˈu
da tˈa
de tˈɤ
dai tˈaɪ
dei tˈeɪ
dao tˈaʊ
dou tˈoʊ
dan tˈan
dang tˈɑŋ
deng tˈəŋ
dong tˈʊŋ
di tˈi
diao tˈjaʊ
die tˈjɛ
diu tˈjoʊ
dian tˈjɛn
ding tˈiŋ
du tˈu
duo tˈwɔ
dui tˈweɪ
duan tˈwan
dun tˈwən
ta tʰˈa
te tʰˈɤ
tai tʰˈaɪ
tao tʰˈaʊ
tou tʰˈoʊ
tan tʰˈan
tang tʰˈɑŋ
teng tʰˈəŋ
tong tʰˈʊŋ
ti tʰˈi
tiao tʰˈjaʊ
tie tʰˈjɛ
tian tʰˈjɛn
ting tʰˈiŋ
tu tʰˈu
tuo tʰˈwɔ
tui tʰˈweɪ
tuan tʰˈwan
tun tʰˈwən
na nˈa
ne nˈɤ
nai nˈaɪ
nei nˈeɪ
nao nˈaʊ
nou nˈoʊ
nan nˈan
nen nˈən
nang nˈɑŋ
neng nˈəŋ
nong nˈʊŋ
ni nˈi
niao nˈjaʊ
nie nˈjɛ
niu nˈjoʊ
nian nˈjɛn
nin nˈin
niang nˈiɑŋ
ning nˈiŋ
nu nˈu
nuo nˈwɔ
nuan nˈwan
nü nˈy
nüe nˈyɛ
la lˈa
le lˈɤ
lai lˈaɪ
lei lˈeɪ
lao lˈaʊ
lou lˈoʊ
lan lˈan
lang lˈɑŋ
leng lˈəŋ
long lˈʊŋ
li lˈi
lia lˈja
liao lˈjaʊ
lie lˈjɛ
liu lˈjoʊ
lian lˈjɛn
lin lˈin
liang lˈiɑŋ
ling lˈiŋ
lu lˈu
luo lˈwɔ
luan lˈwan
lun lˈwən
lü lˈy
lüe lˈyɛ
za tsˈa
ze tsˈɤ
zi tsˈɹ
zai tsˈaɪ
zei tsˈeɪ
zao tsˈaʊ
zou tsˈoʊ
zan tsˈan
zen tsˈən
zang tsˈɑŋ
zeng tsˈəŋ
zong tsˈʊŋ
zu tsˈu
zuo tsˈwɔ
zui tsˈweɪ
zuan tsˈwan
zun tsˈwən
ca tsʰˈa
ce tsʰˈɤ
ci tsʰˈɹ
cai tsʰˈaɪ
cao tsʰˈaʊ
cou tsʰˈoʊ
can tsʰˈan
cen tsʰˈən
cang tsʰˈɑŋ
ceng tsʰˈəŋ
cong tsʰˈʊŋ
cu tsʰˈu
cuo tsʰˈwɔ
cui tsʰˈweɪ
cuan tsʰˈwan
cun tsʰˈwən
sa sˈa
se sˈɤ
si sˈɹ
sai sˈaɪ
sao sˈaʊ
sou sˈoʊ
san sˈan
sen sˈən
sang sˈɑŋ
seng sˈeŋ
song sˈʊŋ
su sˈu
suo sˈwɔ
sui sˈweɪ
suan sˈwan
sun sˈwən
zha ʈʂˈa
zhe ʈʂˈɤ
zhi ʈʂˈʐ
zhai ʈʂˈaɪ
zhei ʈʂˈeɪ
zhao ʈʂˈaʊ
zhou ʈʂˈoʊ
zhan ʈʂˈan
zhen ʈʂˈən
zhang ʈʂˈɑŋ
zheng ʈʂˈəŋ
zhong ʈʂˈʊŋ
zhu ʈʂˈu
zhua ʈʂˈwa
zhuo ʈʂˈwɔ
zhuai ʈʂˈwaɪ
zhui ʈʂˈweɪ
zhuan ʈʂˈwan
zhun ʈʂˈwən
zhuang ʈʂˈwɑŋ
cha ʈʂʰˈa
che ʈʂʰˈɤ
chi ʈʂʰˈʐ
chai ʈʂʰˈaɪ
chao ʈʂʰˈaʊ
chou ʈʂʰˈoʊ
chan ʈʂʰˈan
chen ʈʂʰˈən
chang ʈʂʰˈɑŋ
cheng ʈʂʰˈəŋ
chong ʈʂʰˈʊŋ
chu ʈʂʰˈu
chua ʈʂʰˈwa
chuo ʈʂʰˈwɔ
chuai ʈʂʰˈwaɪ
chui ʈʂʰˈweɪ
chuan ʈʂʰˈwan
chun ʈʂʰˈwən
chuang ʈʂʰˈwɑŋ
sha ʂˈa
she ʂˈɤ
shi ʂˈʐ
shai ʂˈaɪ
shei ʂˈeɪ
shao ʂˈaʊ
shou ʂˈoʊ
shan ʂˈan
shen ʂˈən
shang ʂˈɑŋ
sheng ʂˈəŋ
shu ʂˈu
shua ʂˈwa
shuo ʂˈwɔ
shuai ʂˈwaɪ
shui ʂˈweɪ
shuan ʂˈwan
shun ʂˈwən
shuang ʂˈwɑŋ
re ɹˈɤ
ri ɹˈʐ
rao ɹˈaʊ
rou ɹˈoʊ
ran ɹˈan
ren ɹˈən
rang ɹˈɑŋ
reng ɹˈəŋ
rong ɹˈʊŋ
ru ɹˈu
ruo ɹˈwɔ
rui ɹˈweɪ
ruan ɹˈwan
run ɹˈwən
ji tɕˈi
jia tɕˈja
jiao tɕˈjaʊ
jie tɕˈjɛ
jiu tɕˈjoʊ
jian tɕˈjɛn
jin tɕˈin
jiang tɕˈiɑŋ
jing tɕˈiŋ
jiong tɕˈjʊŋ
ju tɕˈy
jue tɕˈyɛ
juan tɕˈyɛn
jun tɕˈyn
qi tɕʰˈi
qia tɕʰˈja
qiao tɕʰˈjaʊ
qie tɕʰˈjɛ
qiu tɕʰˈjoʊ
qian tɕʰˈjɛn
qin tɕʰˈin
qiang tɕʰˈjɑŋ
qing tɕʰˈiŋ
qiong tɕʰˈjʊŋ
qu tɕʰˈy
que tɕʰˈyɛ
quan tɕʰˈyɛn
qun tɕʰˈyn
xi ɕˈi
xia ɕˈja
xiao ɕˈjaʊ
xie ɕˈjɛ
xiu ɕˈjoʊ
xian ɕˈjɛn
xin ɕˈin
xiang ɕˈiɑŋ
xing ɕˈiŋ
xiong ɕˈjʊŋ
xu ɕˈy
xue ɕˈyɛ
xuan ɕˈyɛn
xun ɕˈyn
ga kˈa
ge kˈɤ
gai kˈaɪ
gei kˈeɪ
gao kˈaʊ
gou kˈoʊ
gan kˈan
gen kˈən
gang kˈɑŋ
geng kˈəŋ
gong kˈʊŋ
gu kˈu
gua kˈwa
guo kˈwɔ
guai kˈwaɪ
gui kˈweɪ
guan kˈwan
gun kˈwən
guang kˈwɑŋ
ka kʰˈa
ke kʰˈɤ
kai kʰˈaɪ
kei kʰˈeɪ
kao kʰˈaʊ
kou kʰˈoʊ
kan kʰˈan
ken kʰˈən
kang kʰˈɑŋ
keng kʰˈəŋ
kong kʰˈʊŋ
ku kʰˈu
kua kʰˈwa
kuo kʰˈwɔ
kuai kʰˈwaɪ
kui kʰˈweɪ
kuan kʰˈwan
kun kʰˈwən
kuang kʰˈwɑŋ
ha xˈa
he xˈɤ
hai xˈaɪ
hei xˈeɪ
hao xˈaʊ
hou xˈoʊ
han xˈan
hen xˈən
hang xˈɑŋ
heng xˈəŋ
hong xˈʊŋ
hu xˈu
hua xˈwa
huo xˈwɔ
huai xˈwaɪ
hui xˈweɪ
huan xˈwan
hun xˈwən
huang xˈwɑŋ
a ˈa
o ˈo
e ˈɤ
er ˈɚ
ai ˈaɪ
ei ˈeɪ
ao ˈaʊ
ou ˈoʊ
an ˈan
en ˈən
ang ˈɑŋ
eng ˈəŋ
yi ˈi
ya jˈa
yao jˈaʊ
ye jˈɛ
you jˈoʊ
yan jˈɛn
yin ˈin
yang jˈɑŋ
ying ˈiŋ
yong ˈjʊŋ
wu ˈu
wa wˈa
wo wˈɔ
wai wˈaɪ
wei wˈeɪ
wan wˈan
wen wˈən
wang wˈɑŋ
weng wˈəŋ
yu ˈy
yue ɥˈɛ
yuan ɥˈɛn
yun ɥˈn
hair xˈaɹ
dianr tˈjaɹ
wanr wˈaɹ
nar nˈaɹ
yanr jˈaɹ
huor xˈwɔɹ
duanr tˈwaɹ
lir lˈjɚ
huir xˈwjɚ
zher ʈʂˈɚ
dour xˈɔɹ
weir wˈɚ
kuair kʰˈwaɹ
guanr gˈwɐʴ
shir ʂˈɚ
yuanr ɥˈɚ
jianr tɕˈjɚ
her xˈɚ
jiar tɕˈjaɹ

bor pˈwɔɹ
xir ɕˈɚ
bianr pˈjɚ
fenr fˈɚ
wenr wˈɚ
der tˈɚ
por pʰˈwɔɹ
yuer ɥˈɚ
mingr mˈjɚ
char ʈʂʰˈaɹ
xingr ɕˈjɚ
zhour ʈʂˈoʊɹ
shour ʂˈoʊɹ
ter tʰˈɚ
yingr ˈjɚ
paor pʰˈaɹ
fangr fˈɑɹ
jingr tɕˈjɚ
shur ʂˈuɹ
qunr tɕʰˈyɹ
hur xˈuɹ
miaor mˈjaʊɹ
biaor pˈjaʊɹ
zhengr ʈʂˈɚ
gour kˈoʊɹ
pair pʰˈaɹ
renr ɹˈɚ
gaor kˈaʊɹ
lo lˈoʊ
tuir tʰˈwɚ
huanr xˈwaɹ
genr kˈɚ
nvr nˈyɹ
qianr tɕʰˈjɚ
hangr xˈɑɹ
chenr ʈʂʰˈɚ
den tˈɚ
lar lˈaɹ
niur nˈjoʊɹ
liur lˈjoʊɹ
tunr tʰˈwɚ
lunr lˈwɚ
tour tʰˈoʊɹ
hour xˈoʊɹ
tianr tʰˈjɚ
mianr mˈjɚ
mar mˈaɹ
pianr pʰˈjɚ
maor mˈaʊɹ
cair tsʰˈɚ
far fˈaɹ
shuor ʂˈwɔɹ
kanr kʰˈaɹ
banr pˈaɹ
ger kˈɚ
sher ʂˈɚ
gunr kˈwɚ
beir pˈɚ
chuanr ʈʂʰˈwɚ
bar pˈaɹ
cunr tsʰˈwɚ
tiaor tʰˈjaʊɹ
shuar ʂˈwaɹ
tur tʰˈuɹ
zhaor ʈʂˈaʊɹ
cher ʈʂʰˈɚ
menr mˈɚ
qingr tɕʰˈjɚ
shanr ʂˈaɹ
mor mˈwɔɹ
zhur ʈʂˈuɹ
wangr wˈɑɹ
zhunr ʈʂˈwɚ
zhir ʈʂˈɚ
haor xˈaʊɹ
shuir ʂˈwɚ
guor kˈwɔɹ
zaor tsˈaʊɹ
juanr tɕˈyɚ
jiar tɕˈjaɹ
xiaor ɕˈjaʊɹ
suor sˈwɔɹ
shaor ʂˈaʊɹ
yir ˈɚ
dir tˈɚ
ganr kˈaɹ
duir tˈwɚ
taor tʰˈaʊɹ
lianr lˈjɚ
benr pˈɚ
fanr fˈaɹ
xuer ɕˈyɚ
pur pʰˈuɹ
jinr tɕˈɚ
kour kʰˈoʊɹ
ker kʰˈɚ
mur mˈuɹ
liaor lˈjaʊɹ
juer tɕˈyɚ
your jˈoʊɹ
xianr ɕˈjɚ
quanr tɕʰˈyɚ
yo jˈoʊ
sanr sˈaɹ
zhuor ʈʂˈwɔɹ
tuor tʰˈwɔɹ
naor nˈaʊɹ
dar tˈaɹ
fur fˈuɹ
dunr tˈwɚ
langr lˈɑɹ
dair tˈaɹ
huar xˈwaɹ
yangr jˈɑɹ
  1. You need to add a tone embedding for languages like Chinese and Japanese. For example, replacing the ProsodyPredictor with the following code (i.e. concatenating the prosody embedding with the text embedding):
class ProsodyPredictor(nn.Module):

    def __init__(self, n_prods, prod_embd, style_dim, d_hid, nlayers, dropout=0.1):
        super().__init__() 
        self.embedding = nn.Embedding(n_prods, prod_embd * 2)
        self.text_encoder = DurationEncoder(sty_dim=style_dim, 
                                            d_model=d_hid,
                                            nlayers=nlayers, 
                                            dropout=dropout)

        self.lstm = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)
        
        self.lstm = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)
        
        self.shared = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.F0 = nn.ModuleList()
        self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))

        self.N = nn.ModuleList()
        self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
        
        self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
        self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)


    def forward(self, texts, prosody, style, text_lengths, alignment, m):
        prosody = self.embedding(prosody)
        texts = torch.cat([texts, prosody], axis=1)
        d = self.text_encoder(texts, style, text_lengths, m)
        
        batch_size = d.shape[0]
        text_size = d.shape[1]
        
        # predict duration
        input_lengths = text_lengths.cpu().numpy()
        x = nn.utils.rnn.pack_padded_sequence(
            d, input_lengths, batch_first=True, enforce_sorted=False)
        
        m = m.to(text_lengths.device).unsqueeze(1)
        
        self.lstm.flatten_parameters()
        x, _ = self.lstm(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(
            x, batch_first=True)
        
        x_pad = torch.zeros([x.shape[0], m.shape[-1], x.shape[-1]])

        x_pad[:, :x.shape[1], :] = x
        x = x_pad.to(x.device)
                
        duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=self.training))
        
        en = (d.transpose(-1, -2) @ alignment)

        return duration.squeeze(-1), en
    
    def F0Ntrain(self, x, s):
        x, _ = self.shared(x.transpose(-1, -2))
        
        F0 = x.transpose(-1, -2)
        for block in self.F0:
            F0 = block(F0, s)
        F0 = self.F0_proj(F0)

        N = x.transpose(-1, -2)
        for block in self.N:
            N = block(N, s)
        N = self.N_proj(N)
        
        return F0.squeeze(1), N.squeeze(1)
    
    def length_to_mask(self, lengths):
        mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
        mask = torch.gt(mask+1, lengths.unsqueeze(1))
        return mask
  1. Modify meldataset.py to return the tones for each IPA and change your train_list.txt in the following format:
data/aishell/train/wav/SSB1100/SSB11000297.wav|$ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ tˈjɛnˈin jˈoʊʂˈənmˈɤ$|X111114444422 111113333555 44444333 33332222555X|382
data/aishell/train/wav/SSB1567/SSB15670392.wav|$ʂˈʐ fˈuʂˈʐ ʂˈʐlˈɤ ʈʂˈəŋtʰˈi jˈɛˈu$fˈaʈʂˈantˈɤ xˈɤɕˈin tɕʰˈytˈʊŋlˈi$|X444 444444 111444 222223333 44444 11133333555 2221111 111114444444X|274
data/aishell/train/wav/SSB0603/SSB06030228.wav|$xˈwɔtˈjɛn tˈəŋ tˈwɔxˈɑŋjˈɛ tɕˈiɑŋʂˈoʊ pˈwɔtɕˈi$|X333344444 3333 11112222444 1111114444 11112222X|223
data/aishell/train/wav/SSB0588/SSB05880296.wav|$ˈinɥˈɛ ˈiʂˈəŋ sˈwɔˈaɪ$|X111444 441111 3333444X|378
data/aishell/train/wav/SSB0315/SSB03150316.wav|$ʈʂʰˈuɕˈyɛʈʂˈɤ kʰˈɤ ʂˈʐˈjʊŋ tɕˈjaʊʈʂʰˈɑŋtˈɤ ˈiɕˈjɛ tʰˈjaʊʂˈəŋ$|X1111122223333 2222 3334444 444444222222555 441111 4444442222X|241
data/aishell/train/wav/SSB0631/SSB06310452.wav|$ɕˈjɛntsˈaɪ tɕˈitɕʰˈi ɕˈyɛxˈweɪ kˈənɹˈən kˈoʊtʰˈʊŋ$|X4444444444 111144444 222244444 11112222 111111111X|229
data/aishell/train/wav/SSB1935/SSB19350402.wav|$xˈwansˈwɔtˈɤ ʂˈʐ tɕʰˈyʂˈʐ lˈaʊpˈaɹtˈɤsˈwən tsˈɹ$|X444443333555 444 44444444 3333444455511111 5555X|345
data/aishell/train/wav/SSB1203/SSB12030292.wav|$pˈiɹˈu tsˈweɪtɕˈin sˈannˈjɛn tɕˈiŋˈiŋ ʈʂˈwɑŋkʰˈwɑŋ lˈiɑŋxˈaʊtˈəŋ$|X333222 44444444444 111122222 11111222 444444444444 2222222223333X|377
data/aishell/train/wav/SSB1024/SSB10240312.wav|$xˈaˈɚ pˈinʂˈʐ tˈiˈu sˈɹʈʂˈʊŋɕˈyɛtˈɤ ʈʂˈaʊpʰˈaɪ ˈy pʰˈɑŋpˈjɛn ʂˈɑŋxˈu ɕˈiɑŋpˈi$|X11133 1111444 44433 444111112222555 1111155555 33 2222211111 1111444 11111333X|231
data/jvs_ver1/jvs088/parallel100/wav24kHz16bit/VOICEACTRESS100_037.wav|$kˈomˈʲɯːɴ ɯˈa $ sˈeːnˈɯ gˈaɯˈa tˈo $ esˈo ɴ nˈɯ kˈaɯˈa nˈo $ gˈoːɽˈʲɯː tɕˈitˈeɴ tˈo nˈaʔ tˈe iɽˈɯ$|XLLLHHHHLL LLL X LLLHHHH LLLLLL LLL X LHHH H HHH LLLLLL LLL X LLLHHHHHH HHHHLLLL LLL HHHL LLL LHHHX|88
data/aishell/train/wav/SSB0671/SSB06710188.wav|$tɕˈiɑŋɕˈjɛn nˈanfˈan ˈjʊŋxˈʊŋ fˈɑŋɕˈin lˈiɑŋjˈoʊ ʈʂˈʊŋɕˈintˈjɛn$|X44444444444 22222222 33332222 44441111 222222222 11111111144444X|363
data/aishell/train/wav/SSB0380/SSB03800184.wav|$kʰˈɤ ɥˈɛxˈan tɕˈjoʊʂˈʐ tʰˈiŋpˈu tɕˈintɕʰˈy$|X3333 1114444 444444444 11111222 4444444444X|323
data/aishell/train/wav/SSB0760/SSB07600247.wav|$tˈɑŋɹˈan wˈɔ ɕˈjɛntsˈaɪ ˈitɕˈiŋ mˈeɪjˈoʊ ʈʂˈɤkˈɤ tsˈɹkˈɤ tsˈaɪkˈən nˈiʂˈwɔ ʈʂˈɤkˈɤ xˈwaɹ$|X11112222 333 4444444444 3311111 22223333 4444444 1111222 444441111 3331111 4444444 44444X|237
data/aishell/train/wav/SSB0016/SSB00160083.wav|$pˈaʂˈʐˈutˈjɛn lˈjoʊlˈiŋtɕʰˈi$|X1112222233333 44444222211111X|245

where X and $ represent the SOS and EOS.

I'll leave this issue open for someone to fork the repo and modify it for Mandarin and Japanese support. I'm unfortunately too busy to work on it now.

Is there any such mapping between cmudict phoneme and IPAs? we need to try Mandarin and English mixed TTS instance, so both Mandarin characters and English words should be turned into the same IPA list for each example.

ˈeɪ
Hi, yl4579, Where is the look-up table from please? the IPA sequence seems weird because its length may be even longer then the length of PinYin string.

@JohnHerry This is exactly the same IPA. The whole point of IPA is international, meaning it represents all sounds in all human languages. I got this table from Wikipedia: https://en.wikipedia.org/wiki/Help:IPA/Mandarin

@JohnHerry This is exactly the same IPA. The whole point of IPA is international, meaning it represents all sounds in all human languages. I got this table from Wikipedia: https://en.wikipedia.org/wiki/Help:IPA/Mandarin

Get it, thanks alot

As symbols = _pause + _initials + [i + j for i in _finals for j in _tones], so the YunMu should be ended with tone.
But in the example below, the YunMu seems to be not ended with tone.

@hdmjdp The train data is as below: "aishell3/train/wav/SSB0018/audio/00180007.wav|25|sil r an4 #1 d a4 #0 j ia1 #1 k uai4 #0 d ian3 #1 x ia4 #0 l ai2 #4 。 eos"

after process, its data is "blank_ r an blank_ d a blank_ j ia blank_ k uai blank_ d ian blank_ x ia blank_ l ai blank_ 。 blank_"

yl4579 says the tone should be handled by the F0 model, not by the asr model( yl4579/AuxiliaryASR#2 (comment)). so I guess that YunMu ended with tone is used to train styletts and YunMu ended without tone is used to train asr.

@liuhuang31 Your samples sound great!
I'm a little confused, which phoneme types did you used to train sytletts and asr. Is my guess correct?

@hermanseu Hi, hermanseu: for asr and styletts model, i use the same phoneme types. As for i separate tone as a feature, so my text_feature is phoneme_feature + tone_feature.
image

And therefore, my symbols is below(same used in asr and styletts):

_pause = ["sil","eos","sp","","#0","#1","#2","#3","#4"]
_initials = ["b","c","ch","d","f","g","h","j","k","l","m","n","p","q","r","s","sh","t","w","x","y","z","zh",]
_finals = ["a","ai","an","ang","ao","e","ei","en","eng","er","i","ia","ian","iang","iao","ie","ii","iii","in","ing","iong","iou","o","ong","ou","u","ua","uai","uan","uang","uei","uen","ueng","uo","v","van","ve","vn","xr"]
_cmu = ["AA","AE","AH","AO","AW","AY","EH","ER","EY","IH","IY","OW","OY","UH","UW","P","B","CH","D","DH","F","G","HH","JH","K","L","M","N","NG","R","S","SH","T","TH","V","W","Y","Z","ZH",]
_punc = ["?","!",",",".",";",":","?","!",",","。",";",":","、",]

symbols = _pause + _initials + _finals + _cmu + _punc
tone_symbols = ['~','0','1','2','3','4','5']

@hermanseu Hi, hermanseu: if you not want to use separate tone as a feature.

The below is a usual way to process symbols, which can use in train asr and styletts at the same time.

symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

@liuhuang31 Thank you for your reply. I currently used the symbols as

symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

I traind asr 80 epochs, but the wer is still 0.4xx. Can you show some asr train logs.
Another question, when there are some silence fragments in the audio, how to label the silence, just ignore or give an sil label?

@hermanseu hi, hermanseu: my symbols not has tone, maybe not suitable for comparison.
image
image

As for silence i will give an sil label.

@liuhuang31 Get it, thanks alot.
So the front and end silence just labeled as blank, and the middle silence labeled as sil?
What happen after 60 epochs?

@hermanseu hi, hermanseu: #2 #3 as a phoneme, front and end silence labeled as sil, other silence can labeled as blank.
I forgot a bit, after 60 epochs, maybe i change lr or add new dataset to finetune.

@liuhuang31 HI, liuhuang31, thank you very mush. It is helpful.

Hi @yl4579 , a quick update:

I've created a script to convert pinyins to IPAs and get filelists in the desired format for Mandarin. Here are train and val lists for aishell3 dataset: train_list_aishell3.txt val_list_aishell3.txt

Class ProsodyPredictor is also updated with your code above. And then I tried to update meldataset.py but got stuck.

  1. What should n_prods and prod_embd be? Should they be stored in config.yml?
  2. I can get the tone for IPA, but where and how I should use the tone?
class FilePathDataset(torch.utils.data.Dataset):
    def __init__(self,
                 data_list,
                 sr=24000,
                 data_augmentation=False,
                 validation=False,
                 ):

        spect_params = SPECT_PARAMS
        mel_params = MEL_PARAMS

        #_data_list = [l[:-1].split('|') for l in data_list]
        _data_list = [l.split('|') for l in data_list]
        self.data_list = [data if len(data) == 4 else (*data, 0) for data in _data_list]
        self.text_cleaner = TextCleaner()
        self.sr = sr

        self.to_melspec = torchaudio.transforms.MelSpectrogram(**MEL_PARAMS)

        self.mean, self.std = -4, 4
        self.data_augmentation = data_augmentation and (not validation)
        self.max_mel_length = 192

#         self.global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True)

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, idx):
        data = self.data_list[idx]
        path = data[0]

        wave, text_tensor, tone_tensor, speaker_id = self._load_tensor(data)

        mel_tensor = preprocess(wave).squeeze()

        acoustic_feature = mel_tensor.squeeze()
        length_feature = acoustic_feature.size(1)
        acoustic_feature = acoustic_feature[:, :(length_feature - length_feature % 2)]

        return speaker_id, acoustic_feature, text_tensor, path

    def _load_tensor(self, data):
        wave_path, text, tone, speaker_id = data
        speaker_id = int(speaker_id)
        wave, sr = sf.read(wave_path)
        if wave.shape[-1] == 2:
            wave = wave[:, 0].squeeze()
        if sr != 24000:
            wave = librosa.resample(wave, sr, 24000)
            print(wave_path, sr)

        wave = np.concatenate([np.zeros([5000]), wave, np.zeros([5000])], axis=0)

        text = self.text_cleaner(text)
        tone = self.text_cleaner(tone)

        text.insert(0, 0)
        text.append(0)

        tone.insert(0, 0)
        tone.append(0)

        text = torch.LongTensor(text)
        tone = torch.LongTensor(tone)

        return wave, text, tone, speaker_id

    def _load_data(self, data):
        wave, text_tensor, tone, speaker_id = self._load_tensor(data)
        mel_tensor = preprocess(wave).squeeze()

        mel_length = mel_tensor.size(1)
        if mel_length > self.max_mel_length:
            random_start = np.random.randint(0, mel_length - self.max_mel_length)
            mel_tensor = mel_tensor[:, random_start:random_start + self.max_mel_length]

        return mel_tensor, speaker_id

Hi, yihui
I saw that all your Mandarin IPA phonemes are compat, no space between characters. I want to made a ShengYunMu based TTS model, is there any symbol alphabeta about IPA syllables with that? I can not find out single units from the mapping of #10 (comment)

@JohnHerry did you try to add a space in the mapping? For example:
'ba': 'pˈ a', <---- there's a space between pˈ and a

hello @liuhuang31 ,
there is no punctuation in the aishell-3 dataset. so how can i train a model which can use the punctuation to control the voice pause? thanks for the reply.

@skysbird Hi skysbird,
For punctuation, you can use a frontend to predict prosody, then use MFA to change the prosody.

if i want to use punctuation to control the pause, i must have the dataset that has punctuation ? am i right? and yes i'm Chinese, maybe we are in the same wechat group :)

@skysbird hi, yes, you dataset must has punctuation or other pause symbols to control the pause.

@JohnHerry did you try to add a space in the mapping? For example: 'ba': 'pˈ a', <---- there's a space between pˈ and a

Thanks for the advice. I have noticed that there is a stress symbol in the IPA mapped from PinYin. I think Chinese utterance do not need the stress symbol, but it can be viewed as d spliter between Shengmu and Yunmu.

@skysbird Hi skysbird, For punctuation, you can use a frontend to predict prosody, then use MFA to change the prosody.

We had tried to use BERT-based Text-Prosody predictor, but it not good enough. especially the PP(#2) and IP(#3), they get low precision and recall. And what is more, the text prosody [or pause] result is everage, I think it is not good for multi-speaker models.

For mandarin, i didn't use ipa_phonemes, use pinyin's initials and finals phonemes.

  1. You can use pypinyin to generate pinyin.
  2. The _initials and _finals used in pypinyin, then the symbols is below:

_pause = ["sil", "eos", "sp", ...] _initials = ["b", "c","ch", ...] _finals = ["a", "ai", ...] _tones = ["1", "2", "3", "4", "5"] symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

@liuhuang31 Excuse me,i see symbols include "_pause", that means asr‘s text label come from mfa result?

@sunnnnnnnny hi, sunnnnnnnny. First use frontend predict text's prosody(pause), then use mfa results to change its prosody. For asr, the #1 will not be use and remove, #2 #3 is as a phoneme.

@sunnnnnnnny hi, sunnnnnnnny. First use frontend predict text's prosody(pause), then use mfa results to change its prosody. For asr, the #1 will not be use and remove, #2 #3 is as a phoneme.

thank you quick reply; i see it;

@liuhuang31 excuse me, can you share some train's 1st stage loss curves?

@sunnnnnnnny image

image

thanks a lot!

@sunnnnnnnny hi, sunnnnnnnny. First use frontend predict text's prosody(pause), then use mfa results to change its prosody. For asr, the #1 will not be use and remove, #2 #3 is as a phoneme.

Hi, liuhuang31, In the StyleTTS paper, it has a "style vector" to help predict duration, speech speed, emotion. is that means there will be no need for the frontend prosody anymore? Have you tried the traning instantce without those prosody syllables? How about that?

@sunnnnnnnny hi, sunnnnnnnny. First use frontend predict text's prosody(pause), then use mfa results to change its prosody. For asr, the #1 will not be use and remove, #2 #3 is as a phoneme.

Hi, liuhuang31, In the StyleTTS paper, it has a "style vector" to help predict duration, speech speed, emotion. is that means there will be no need for the frontend prosody anymore? Have you tried the traning instantce without those prosody syllables? How about that?

@JohnHerry hi, 'style vector' really help predict duration, speech speed, so i think not use prosody is ok. The prosody '#2 #3' used in styletts is as a phoneme(styletts not use '#1'), can manually controlled pause time.
I don't trained other exp without prosody_phoneme(#2 #3 as a phoneme).

@JohnHerry hi, 'style vector' really help predict duration, speech speed, so i think not use prosody is ok. The prosody '#2 #3' used in styletts is as a phoneme(styletts not use '#1'), can manually controlled pause time. I don't trained other exp without prosody_phoneme(#2 #3 as a phoneme).

Known, Thank you very much.

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31 How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme? and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

HI, JohnHerry (1) I didn't train phoneme level bert model. In the below, ShengMu is _initials, YunMu is _finals, YinDiao is _tones. For text features: phoneme, prosody, tone. phoneme features treat the ShengMu, YunMu as separate phonemes.

For example, give a text “去上学校”: First we generate its prosody: “去上学校” -> “去#1上#1学校#4.” Second use pypinyin to generate chinese pinyin: “去#1上#1学校#4.” -> “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” Third generate its text features(phoneme, prosody, tone): “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” -> "q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3". Certainly, you should convert phoneme, prosody and tone to id.

_pause = ["sil", "eos", "sp", ...]
_initials = ["b", "c","ch", ...]
_finals = ["a", "ai", ...]
_tones = ["1", "2", "3", "4", "5"]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

(2) As for me, just use the open dataset: aishell3 dataset(zhvoice dataset also can use, but its quality is very poor).

(3) Yes, "Pypinyin generated pinyin are error-prone", but in my view, if the dataset is big enough, the error will be average and "eliminate". Also in my experiment use aishell3 dataset, i can generate a normal audio, which sound not bad.

"q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3" -> the symbols you use are YunMu without tones (tones are placed in the last column)
"symbols = _pause + _initials + [i + j for i in _finals for j in _tones]" -> the symbols are YunMu with tones
So, I'm confused. Which one is it?

@zhouyong64 you can use:
(1) "q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3".
(2) Also you can use "q5 u5 sh5 ang5 x5 ue3 x3 iao3|#1 #1 #1 #1 #0 #0 #4 #4".
(3) Even you can use "q u5 sh ang5 x ue3 x iao3|#1 #1 #1 #1 #0 #0 #4 #4".
(4) or prosody as a phoneme(if has #2 #3 level prosody): "q5 u5 #2 sh5 ang5 x5 ue3 x3 iao3".
(5) or prosody as a phoneme(if has #2 #3 level prosody): "q u #2 sh ang x ue x iao|5 5 0 5 5 3 3 3 3".
....

Finally i use (5) to train the model.

I did try training for other languages including Mandarin, Japanese, Hindi etc., though it requires a few changes:

  1. You need to phonemize Chinese into IPAs. You can use either https://github.com/bootphon/phonemizer or a look-up table to replace Chinese characters into IPAs. The pre-trained text aligner already includes AiShell (Mandarin dataset), with the following IPA conversion table from Pinyin. It may be slightly different from phonemizer, as it didn't work for me for Chinese.
ba pˈa
bo pˈwɔ
bai pˈaɪ
bei pˈeɪ
bao pˈaʊ
ban pˈan
ben pˈən
bang pˈɑŋ
beng pˈəŋ
bi pˈi
biao pˈjaʊ
bie pˈjɛ
bian pˈjɛn
bin pˈin
bing pˈiŋ
bu pˈu
pa pʰˈa
po pʰˈwɔ
pai pʰˈaɪ
pei pʰˈeɪ
pao pʰˈaʊ
pou pʰˈoʊ
pan pʰˈan
pen pʰˈən
pang pʰˈɑŋ
peng pʰˈəŋ
pi pʰˈi
piao pʰˈjaʊ
pie pʰˈjɛ
pian pʰˈjɛn
pin pʰˈin
ping pʰˈiŋ
pu pʰˈu
ma mˈa
me mˈɤ
mo mˈwɔ
mai mˈaɪ
mei mˈeɪ
mao mˈaʊ
mou mˈoʊ
man mˈan
men mˈən
mang mˈɑŋ
meng mˈəŋ
mi mˈi
miao mˈjaʊ
mie mˈjɛ
miu mˈju
mian mˈjɛn
min mˈin
ming mˈiŋ
mu mˈu
fa fˈa
fo fˈwɔ
fei fˈeɪ
fou fˈoʊ
fan fˈan
fen fˈən
fang fˈɑŋ
feng fˈəŋ
fu fˈu
da tˈa
de tˈɤ
dai tˈaɪ
dei tˈeɪ
dao tˈaʊ
dou tˈoʊ
dan tˈan
dang tˈɑŋ
deng tˈəŋ
dong tˈʊŋ
di tˈi
diao tˈjaʊ
die tˈjɛ
diu tˈjoʊ
dian tˈjɛn
ding tˈiŋ
du tˈu
duo tˈwɔ
dui tˈweɪ
duan tˈwan
dun tˈwən
ta tʰˈa
te tʰˈɤ
tai tʰˈaɪ
tao tʰˈaʊ
tou tʰˈoʊ
tan tʰˈan
tang tʰˈɑŋ
teng tʰˈəŋ
tong tʰˈʊŋ
ti tʰˈi
tiao tʰˈjaʊ
tie tʰˈjɛ
tian tʰˈjɛn
ting tʰˈiŋ
tu tʰˈu
tuo tʰˈwɔ
tui tʰˈweɪ
tuan tʰˈwan
tun tʰˈwən
na nˈa
ne nˈɤ
nai nˈaɪ
nei nˈeɪ
nao nˈaʊ
nou nˈoʊ
nan nˈan
nen nˈən
nang nˈɑŋ
neng nˈəŋ
nong nˈʊŋ
ni nˈi
niao nˈjaʊ
nie nˈjɛ
niu nˈjoʊ
nian nˈjɛn
nin nˈin
niang nˈiɑŋ
ning nˈiŋ
nu nˈu
nuo nˈwɔ
nuan nˈwan
nü nˈy
nüe nˈyɛ
la lˈa
le lˈɤ
lai lˈaɪ
lei lˈeɪ
lao lˈaʊ
lou lˈoʊ
lan lˈan
lang lˈɑŋ
leng lˈəŋ
long lˈʊŋ
li lˈi
lia lˈja
liao lˈjaʊ
lie lˈjɛ
liu lˈjoʊ
lian lˈjɛn
lin lˈin
liang lˈiɑŋ
ling lˈiŋ
lu lˈu
luo lˈwɔ
luan lˈwan
lun lˈwən
lü lˈy
lüe lˈyɛ
za tsˈa
ze tsˈɤ
zi tsˈɹ
zai tsˈaɪ
zei tsˈeɪ
zao tsˈaʊ
zou tsˈoʊ
zan tsˈan
zen tsˈən
zang tsˈɑŋ
zeng tsˈəŋ
zong tsˈʊŋ
zu tsˈu
zuo tsˈwɔ
zui tsˈweɪ
zuan tsˈwan
zun tsˈwən
ca tsʰˈa
ce tsʰˈɤ
ci tsʰˈɹ
cai tsʰˈaɪ
cao tsʰˈaʊ
cou tsʰˈoʊ
can tsʰˈan
cen tsʰˈən
cang tsʰˈɑŋ
ceng tsʰˈəŋ
cong tsʰˈʊŋ
cu tsʰˈu
cuo tsʰˈwɔ
cui tsʰˈweɪ
cuan tsʰˈwan
cun tsʰˈwən
sa sˈa
se sˈɤ
si sˈɹ
sai sˈaɪ
sao sˈaʊ
sou sˈoʊ
san sˈan
sen sˈən
sang sˈɑŋ
seng sˈeŋ
song sˈʊŋ
su sˈu
suo sˈwɔ
sui sˈweɪ
suan sˈwan
sun sˈwən
zha ʈʂˈa
zhe ʈʂˈɤ
zhi ʈʂˈʐ
zhai ʈʂˈaɪ
zhei ʈʂˈeɪ
zhao ʈʂˈaʊ
zhou ʈʂˈoʊ
zhan ʈʂˈan
zhen ʈʂˈən
zhang ʈʂˈɑŋ
zheng ʈʂˈəŋ
zhong ʈʂˈʊŋ
zhu ʈʂˈu
zhua ʈʂˈwa
zhuo ʈʂˈwɔ
zhuai ʈʂˈwaɪ
zhui ʈʂˈweɪ
zhuan ʈʂˈwan
zhun ʈʂˈwən
zhuang ʈʂˈwɑŋ
cha ʈʂʰˈa
che ʈʂʰˈɤ
chi ʈʂʰˈʐ
chai ʈʂʰˈaɪ
chao ʈʂʰˈaʊ
chou ʈʂʰˈoʊ
chan ʈʂʰˈan
chen ʈʂʰˈən
chang ʈʂʰˈɑŋ
cheng ʈʂʰˈəŋ
chong ʈʂʰˈʊŋ
chu ʈʂʰˈu
chua ʈʂʰˈwa
chuo ʈʂʰˈwɔ
chuai ʈʂʰˈwaɪ
chui ʈʂʰˈweɪ
chuan ʈʂʰˈwan
chun ʈʂʰˈwən
chuang ʈʂʰˈwɑŋ
sha ʂˈa
she ʂˈɤ
shi ʂˈʐ
shai ʂˈaɪ
shei ʂˈeɪ
shao ʂˈaʊ
shou ʂˈoʊ
shan ʂˈan
shen ʂˈən
shang ʂˈɑŋ
sheng ʂˈəŋ
shu ʂˈu
shua ʂˈwa
shuo ʂˈwɔ
shuai ʂˈwaɪ
shui ʂˈweɪ
shuan ʂˈwan
shun ʂˈwən
shuang ʂˈwɑŋ
re ɹˈɤ
ri ɹˈʐ
rao ɹˈaʊ
rou ɹˈoʊ
ran ɹˈan
ren ɹˈən
rang ɹˈɑŋ
reng ɹˈəŋ
rong ɹˈʊŋ
ru ɹˈu
ruo ɹˈwɔ
rui ɹˈweɪ
ruan ɹˈwan
run ɹˈwən
ji tɕˈi
jia tɕˈja
jiao tɕˈjaʊ
jie tɕˈjɛ
jiu tɕˈjoʊ
jian tɕˈjɛn
jin tɕˈin
jiang tɕˈiɑŋ
jing tɕˈiŋ
jiong tɕˈjʊŋ
ju tɕˈy
jue tɕˈyɛ
juan tɕˈyɛn
jun tɕˈyn
qi tɕʰˈi
qia tɕʰˈja
qiao tɕʰˈjaʊ
qie tɕʰˈjɛ
qiu tɕʰˈjoʊ
qian tɕʰˈjɛn
qin tɕʰˈin
qiang tɕʰˈjɑŋ
qing tɕʰˈiŋ
qiong tɕʰˈjʊŋ
qu tɕʰˈy
que tɕʰˈyɛ
quan tɕʰˈyɛn
qun tɕʰˈyn
xi ɕˈi
xia ɕˈja
xiao ɕˈjaʊ
xie ɕˈjɛ
xiu ɕˈjoʊ
xian ɕˈjɛn
xin ɕˈin
xiang ɕˈiɑŋ
xing ɕˈiŋ
xiong ɕˈjʊŋ
xu ɕˈy
xue ɕˈyɛ
xuan ɕˈyɛn
xun ɕˈyn
ga kˈa
ge kˈɤ
gai kˈaɪ
gei kˈeɪ
gao kˈaʊ
gou kˈoʊ
gan kˈan
gen kˈən
gang kˈɑŋ
geng kˈəŋ
gong kˈʊŋ
gu kˈu
gua kˈwa
guo kˈwɔ
guai kˈwaɪ
gui kˈweɪ
guan kˈwan
gun kˈwən
guang kˈwɑŋ
ka kʰˈa
ke kʰˈɤ
kai kʰˈaɪ
kei kʰˈeɪ
kao kʰˈaʊ
kou kʰˈoʊ
kan kʰˈan
ken kʰˈən
kang kʰˈɑŋ
keng kʰˈəŋ
kong kʰˈʊŋ
ku kʰˈu
kua kʰˈwa
kuo kʰˈwɔ
kuai kʰˈwaɪ
kui kʰˈweɪ
kuan kʰˈwan
kun kʰˈwən
kuang kʰˈwɑŋ
ha xˈa
he xˈɤ
hai xˈaɪ
hei xˈeɪ
hao xˈaʊ
hou xˈoʊ
han xˈan
hen xˈən
hang xˈɑŋ
heng xˈəŋ
hong xˈʊŋ
hu xˈu
hua xˈwa
huo xˈwɔ
huai xˈwaɪ
hui xˈweɪ
huan xˈwan
hun xˈwən
huang xˈwɑŋ
a ˈa
o ˈo
e ˈɤ
er ˈɚ
ai ˈaɪ
ei ˈeɪ
ao ˈaʊ
ou ˈoʊ
an ˈan
en ˈən
ang ˈɑŋ
eng ˈəŋ
yi ˈi
ya jˈa
yao jˈaʊ
ye jˈɛ
you jˈoʊ
yan jˈɛn
yin ˈin
yang jˈɑŋ
ying ˈiŋ
yong ˈjʊŋ
wu ˈu
wa wˈa
wo wˈɔ
wai wˈaɪ
wei wˈeɪ
wan wˈan
wen wˈən
wang wˈɑŋ
weng wˈəŋ
yu ˈy
yue ɥˈɛ
yuan ɥˈɛn
yun ɥˈn
hair xˈaɹ
dianr tˈjaɹ
wanr wˈaɹ
nar nˈaɹ
yanr jˈaɹ
huor xˈwɔɹ
duanr tˈwaɹ
lir lˈjɚ
huir xˈwjɚ
zher ʈʂˈɚ
dour xˈɔɹ
weir wˈɚ
kuair kʰˈwaɹ
guanr gˈwɐʴ
shir ʂˈɚ
yuanr ɥˈɚ
jianr tɕˈjɚ
her xˈɚ
jiar tɕˈjaɹ

bor pˈwɔɹ
xir ɕˈɚ
bianr pˈjɚ
fenr fˈɚ
wenr wˈɚ
der tˈɚ
por pʰˈwɔɹ
yuer ɥˈɚ
mingr mˈjɚ
char ʈʂʰˈaɹ
xingr ɕˈjɚ
zhour ʈʂˈoʊɹ
shour ʂˈoʊɹ
ter tʰˈɚ
yingr ˈjɚ
paor pʰˈaɹ
fangr fˈɑɹ
jingr tɕˈjɚ
shur ʂˈuɹ
qunr tɕʰˈyɹ
hur xˈuɹ
miaor mˈjaʊɹ
biaor pˈjaʊɹ
zhengr ʈʂˈɚ
gour kˈoʊɹ
pair pʰˈaɹ
renr ɹˈɚ
gaor kˈaʊɹ
lo lˈoʊ
tuir tʰˈwɚ
huanr xˈwaɹ
genr kˈɚ
nvr nˈyɹ
qianr tɕʰˈjɚ
hangr xˈɑɹ
chenr ʈʂʰˈɚ
den tˈɚ
lar lˈaɹ
niur nˈjoʊɹ
liur lˈjoʊɹ
tunr tʰˈwɚ
lunr lˈwɚ
tour tʰˈoʊɹ
hour xˈoʊɹ
tianr tʰˈjɚ
mianr mˈjɚ
mar mˈaɹ
pianr pʰˈjɚ
maor mˈaʊɹ
cair tsʰˈɚ
far fˈaɹ
shuor ʂˈwɔɹ
kanr kʰˈaɹ
banr pˈaɹ
ger kˈɚ
sher ʂˈɚ
gunr kˈwɚ
beir pˈɚ
chuanr ʈʂʰˈwɚ
bar pˈaɹ
cunr tsʰˈwɚ
tiaor tʰˈjaʊɹ
shuar ʂˈwaɹ
tur tʰˈuɹ
zhaor ʈʂˈaʊɹ
cher ʈʂʰˈɚ
menr mˈɚ
qingr tɕʰˈjɚ
shanr ʂˈaɹ
mor mˈwɔɹ
zhur ʈʂˈuɹ
wangr wˈɑɹ
zhunr ʈʂˈwɚ
zhir ʈʂˈɚ
haor xˈaʊɹ
shuir ʂˈwɚ
guor kˈwɔɹ
zaor tsˈaʊɹ
juanr tɕˈyɚ
jiar tɕˈjaɹ
xiaor ɕˈjaʊɹ
suor sˈwɔɹ
shaor ʂˈaʊɹ
yir ˈɚ
dir tˈɚ
ganr kˈaɹ
duir tˈwɚ
taor tʰˈaʊɹ
lianr lˈjɚ
benr pˈɚ
fanr fˈaɹ
xuer ɕˈyɚ
pur pʰˈuɹ
jinr tɕˈɚ
kour kʰˈoʊɹ
ker kʰˈɚ
mur mˈuɹ
liaor lˈjaʊɹ
juer tɕˈyɚ
your jˈoʊɹ
xianr ɕˈjɚ
quanr tɕʰˈyɚ
yo jˈoʊ
sanr sˈaɹ
zhuor ʈʂˈwɔɹ
tuor tʰˈwɔɹ
naor nˈaʊɹ
dar tˈaɹ
fur fˈuɹ
dunr tˈwɚ
langr lˈɑɹ
dair tˈaɹ
huar xˈwaɹ
yangr jˈɑɹ
  1. You need to add a tone embedding for languages like Chinese and Japanese. For example, replacing the ProsodyPredictor with the following code (i.e. concatenating the prosody embedding with the text embedding):
class ProsodyPredictor(nn.Module):

    def __init__(self, n_prods, prod_embd, style_dim, d_hid, nlayers, dropout=0.1):
        super().__init__() 
        self.embedding = nn.Embedding(n_prods, prod_embd * 2)
        self.text_encoder = DurationEncoder(sty_dim=style_dim, 
                                            d_model=d_hid,
                                            nlayers=nlayers, 
                                            dropout=dropout)

        self.lstm = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)
        
        self.lstm = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)
        
        self.shared = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.F0 = nn.ModuleList()
        self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))

        self.N = nn.ModuleList()
        self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
        
        self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
        self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)


    def forward(self, texts, prosody, style, text_lengths, alignment, m):
        prosody = self.embedding(prosody)
        texts = torch.cat([texts, prosody], axis=1)
        d = self.text_encoder(texts, style, text_lengths, m)
        
        batch_size = d.shape[0]
        text_size = d.shape[1]
        
        # predict duration
        input_lengths = text_lengths.cpu().numpy()
        x = nn.utils.rnn.pack_padded_sequence(
            d, input_lengths, batch_first=True, enforce_sorted=False)
        
        m = m.to(text_lengths.device).unsqueeze(1)
        
        self.lstm.flatten_parameters()
        x, _ = self.lstm(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(
            x, batch_first=True)
        
        x_pad = torch.zeros([x.shape[0], m.shape[-1], x.shape[-1]])

        x_pad[:, :x.shape[1], :] = x
        x = x_pad.to(x.device)
                
        duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=self.training))
        
        en = (d.transpose(-1, -2) @ alignment)

        return duration.squeeze(-1), en
    
    def F0Ntrain(self, x, s):
        x, _ = self.shared(x.transpose(-1, -2))
        
        F0 = x.transpose(-1, -2)
        for block in self.F0:
            F0 = block(F0, s)
        F0 = self.F0_proj(F0)

        N = x.transpose(-1, -2)
        for block in self.N:
            N = block(N, s)
        N = self.N_proj(N)
        
        return F0.squeeze(1), N.squeeze(1)
    
    def length_to_mask(self, lengths):
        mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
        mask = torch.gt(mask+1, lengths.unsqueeze(1))
        return mask
  1. Modify meldataset.py to return the tones for each IPA and change your train_list.txt in the following format:
data/aishell/train/wav/SSB1100/SSB11000297.wav|$ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ tˈjɛnˈin jˈoʊʂˈənmˈɤ$|X111114444422 111113333555 44444333 33332222555X|382
data/aishell/train/wav/SSB1567/SSB15670392.wav|$ʂˈʐ fˈuʂˈʐ ʂˈʐlˈɤ ʈʂˈəŋtʰˈi jˈɛˈu$fˈaʈʂˈantˈɤ xˈɤɕˈin tɕʰˈytˈʊŋlˈi$|X444 444444 111444 222223333 44444 11133333555 2221111 111114444444X|274
data/aishell/train/wav/SSB0603/SSB06030228.wav|$xˈwɔtˈjɛn tˈəŋ tˈwɔxˈɑŋjˈɛ tɕˈiɑŋʂˈoʊ pˈwɔtɕˈi$|X333344444 3333 11112222444 1111114444 11112222X|223
data/aishell/train/wav/SSB0588/SSB05880296.wav|$ˈinɥˈɛ ˈiʂˈəŋ sˈwɔˈaɪ$|X111444 441111 3333444X|378
data/aishell/train/wav/SSB0315/SSB03150316.wav|$ʈʂʰˈuɕˈyɛʈʂˈɤ kʰˈɤ ʂˈʐˈjʊŋ tɕˈjaʊʈʂʰˈɑŋtˈɤ ˈiɕˈjɛ tʰˈjaʊʂˈəŋ$|X1111122223333 2222 3334444 444444222222555 441111 4444442222X|241
data/aishell/train/wav/SSB0631/SSB06310452.wav|$ɕˈjɛntsˈaɪ tɕˈitɕʰˈi ɕˈyɛxˈweɪ kˈənɹˈən kˈoʊtʰˈʊŋ$|X4444444444 111144444 222244444 11112222 111111111X|229
data/aishell/train/wav/SSB1935/SSB19350402.wav|$xˈwansˈwɔtˈɤ ʂˈʐ tɕʰˈyʂˈʐ lˈaʊpˈaɹtˈɤsˈwən tsˈɹ$|X444443333555 444 44444444 3333444455511111 5555X|345
data/aishell/train/wav/SSB1203/SSB12030292.wav|$pˈiɹˈu tsˈweɪtɕˈin sˈannˈjɛn tɕˈiŋˈiŋ ʈʂˈwɑŋkʰˈwɑŋ lˈiɑŋxˈaʊtˈəŋ$|X333222 44444444444 111122222 11111222 444444444444 2222222223333X|377
data/aishell/train/wav/SSB1024/SSB10240312.wav|$xˈaˈɚ pˈinʂˈʐ tˈiˈu sˈɹʈʂˈʊŋɕˈyɛtˈɤ ʈʂˈaʊpʰˈaɪ ˈy pʰˈɑŋpˈjɛn ʂˈɑŋxˈu ɕˈiɑŋpˈi$|X11133 1111444 44433 444111112222555 1111155555 33 2222211111 1111444 11111333X|231
data/jvs_ver1/jvs088/parallel100/wav24kHz16bit/VOICEACTRESS100_037.wav|$kˈomˈʲɯːɴ ɯˈa $ sˈeːnˈɯ gˈaɯˈa tˈo $ esˈo ɴ nˈɯ kˈaɯˈa nˈo $ gˈoːɽˈʲɯː tɕˈitˈeɴ tˈo nˈaʔ tˈe iɽˈɯ$|XLLLHHHHLL LLL X LLLHHHH LLLLLL LLL X LHHH H HHH LLLLLL LLL X LLLHHHHHH HHHHLLLL LLL HHHL LLL LHHHX|88
data/aishell/train/wav/SSB0671/SSB06710188.wav|$tɕˈiɑŋɕˈjɛn nˈanfˈan ˈjʊŋxˈʊŋ fˈɑŋɕˈin lˈiɑŋjˈoʊ ʈʂˈʊŋɕˈintˈjɛn$|X44444444444 22222222 33332222 44441111 222222222 11111111144444X|363
data/aishell/train/wav/SSB0380/SSB03800184.wav|$kʰˈɤ ɥˈɛxˈan tɕˈjoʊʂˈʐ tʰˈiŋpˈu tɕˈintɕʰˈy$|X3333 1114444 444444444 11111222 4444444444X|323
data/aishell/train/wav/SSB0760/SSB07600247.wav|$tˈɑŋɹˈan wˈɔ ɕˈjɛntsˈaɪ ˈitɕˈiŋ mˈeɪjˈoʊ ʈʂˈɤkˈɤ tsˈɹkˈɤ tsˈaɪkˈən nˈiʂˈwɔ ʈʂˈɤkˈɤ xˈwaɹ$|X11112222 333 4444444444 3311111 22223333 4444444 1111222 444441111 3331111 4444444 44444X|237
data/aishell/train/wav/SSB0016/SSB00160083.wav|$pˈaʂˈʐˈutˈjɛn lˈjoʊlˈiŋtɕʰˈi$|X1112222233333 44444222211111X|245

where X and $ represent the SOS and EOS.

I'll leave this issue open for someone to fork the repo and modify it for Mandarin and Japanese support. I'm unfortunately too busy to work on it now.

Hi, thanks for your helpful information. Can you also help to provide the inference code sample for other language like Chinese for StyleTTS? Many thanks in advance.