Text2Video

This is code for ICASSP 2022: "Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary". Project Page

Introduction

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more ﬂexible and not subject to vulnerability due to speaker variation; 3) It signiﬁcantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

Data / Preprocessing

Set up

Git clone repo

git clone git@github.com:sibozhang/Text2Video.git

Download and install modified vid2vid repo vid2vid

Prepare data and folder in the following order

Text2Video
├── *phoneme_data
├── model
├── ...
vid2vid
├── ...
venv
├── vid2vid

Setup env

sudo apt-get install sox libsox-fmt-mp3
pip install zhon
pip install moviepy
pip install ffmpeg
pip install dominate
pip install pydub

For Chinese, we use vosk to get timestamp of each words. Please install vosk from https://alphacephei.com/vosk/install and unpack as 'model' in the current folder. or install:

pip install vosk
pip install cn2an
pip install pypinyin

Testing

Activate vitrual environment vid2vid

source ../venv/vid2vid/bin/activate

Generate video with real audio in English

sh text2video_audio.sh $1 $2

Generate video with TTS audio in English

sh text2video_tts.sh $1 $2 $3

Generate video with TTS audio in Chinese

sh text2video_tts.sh $1 $2 $3

$1: "input text" $2: person $3: fill f for female or m for male (gender)

Example 1. test VidTIMIT data with real audio.

sh text2video_audio.sh "She had your dark suit in greasy wash water all year." fadg0 f

Example 2. test VidTIMIT data with TTS audio.

sh text2video_tts.sh "She had your dark suit in greasy wash water all year." fadg0 f

Example 3. test with Chinese female TTS audio.

sh text2video_tts_chinese.sh "正在为您查询合肥的天气情况。今天是2020年2月24日，合肥市今天多云，最低温度9摄氏度，最高温度15摄氏度，微风。" henan f

Training with your own data

Citation

Please cite our paper in your publications.

Sibo Zhang, Jiahong Yuan, Miao Liao, Liangjun Zhang. PDF Result Video

@article{zhang2021text2video,
  title={Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary},
  author={Zhang, Sibo and Yuan, Jiahong and Liao, Miao and Zhang, Liangjun},
  journal={arXiv preprint arXiv:2104.14631},
  year={2021}
}

Appendices

ARPABET

Ackowledgements

This code is based on the vid2vid framework.

About

ICASSP 2022: "Text2Video: text-driven talking-head video synthesis with phonetic dictionary"

Languages

Language:Python 80.2%Language:C 9.3%Language:C++ 5.9%Language:TeX 3.0%Language:Cuda 0.7%Language:Cython 0.5%Language:Makefile 0.1%Language:Shell 0.1%Language:Fortran 0.0%Language:M4 0.0%Language:CSS 0.0%Language:JavaScript 0.0%Language:Roff 0.0%Language:MATLAB 0.0%