KevinEloff/learning-to-speak

Learning to Speak and Hear Through Multi-Agent Communication over a Continuous Acoustic Channel

eSpeak example audio

Here we give samples of eSpeak generated audio, using eSpeak's internal phonetic descriptions. The text, phonetic description, and audio output are given. Our agents use eSpeak's phoneset, which we convert to IPA for display (using lexconvert).

"Hello World": hələʊ wəːld

Tacotron 2 + HiFi GAN example audio

Here we give samples of Tacotron 2 + HiFi-GAN generated audio.

"Hello World": HH AH0 L OW1 W ER1 L D

Tacotron samples

Here we vary the first (s1) attribute and leave the other attributes constant

s1 = 0: ɡoikiksss

s1 = 1: sikiksss

s1 = 2: iikiksss

s1 = 3: iiksssss
s1 = 4: aikkssss

Grounded one-word audio samples

Target word	Ground truth	Predicted phones
Up	`ʌp`	`ʌvb`
Down	`daʊn`	`daʊ`
Left	`lɛft`	`lɛ`
Right	`ɹaɪt`	`ɹaɪʃjəːn`

KevinEloff / learning-to-speak

Learning to Speak and Hear Through Multi-Agent Communication over a Continuous Acoustic Channel

eSpeak example audio

Tacotron 2 + HiFi GAN example audio

Tacotron samples

Grounded one-word audio samples

About