Speech-Driven Expression Blendshape Based on Single-Layer Self-attention Network

HelloWorld: 9th

Our final results are as follows.

result.mp4

For more contest details, please refer to official website.

1. Data Process

The data provided here is Not complete data. Due to copyright issues, please request the dataset on the official website.

Distribution of speech frame lengths:

Run

conda create -n aligner -c conda-forge montreal-forced-aligner python=3.8
conda activate aligner
conda config --add channels conda-forge
conda install montreal-forced-aligner
mfa models download acoustic mandarin_mfa
mfa model download dictionary mandarin_mfa
mfa model inspect acoustic mandarin_mfa      # View the acoustic model

Then

pip install -r requirement1.txt
python data_process/process.py

2. Train model

2.1 Dependencies

python 3.7

conda create -n AIWIN python=3.7
conda activate AIWIN
cd <path to your project>`
pip install -r requirements.txt

2.2 Make lmdb data

python My/scripts/aiwin_dataset_to_lmdb.py ./data

Copy the output and paste to Tri/config/multimodal_context.yml, such as:

data_mean: [0.07876, 0.00280, 0.01174, 0.18354, 0.10486, 0.16363, 0.10860, 0.00205, 0.01784, 0.22835, 0.22417, 0.00615, 0.00558, 0.06443, 0.06593, 0.18330, 0.17782, 0.06199, 0.04290, 0.04572, 0.19684, 0.03967, 0.03928, 0.29169, 0.29800, 0.05240, 0.04886, 0.17750, 0.17757, 0.09945, 0.00002, 0.00002, 0.01264, 0.12944, 0.12708, 0.08526, 0.08594]
data_std: [0.04059, 0.00566, 0.01210, 0.11373, 0.09498, 0.11489, 0.10505, 0.01894, 0.02042, 0.14744, 0.14685, 0.01871, 0.01970, 0.02078, 0.02137, 0.05933, 0.05742, 0.04199, 0.04282, 0.02453, 0.08287, 0.00617, 0.00626, 0.16426, 0.16778, 0.02844, 0.02643, 0.04753, 0.04750, 0.04343, 0.00047, 0.00047, 0.00456, 0.04094, 0.04016, 0.01510, 0.01562]

If you meet

(AIWIN) [yangsc21@mjrc-server11 AIWIN]$ python My/scripts/aiwin_dataset_to_lmdb.py ./data
Traceback (most recent call last):
  File "My/scripts/aiwin_dataset_to_lmdb.py", line 7, in <module>
    import pyarrow
  File "/ceph/home/yangsc21/anaconda3/envs/AIWIN/lib/python3.7/site-packages/pyarrow/__init__.py", line 49, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ImportError: libcrypt.so.1: cannot open shared object file: No such file or directory

Try to check whereis libcrypt.so.1 and ln /usr/lib/libcrypt.so libcrypt.so.1

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"/usr/lib/libcrypt.so.1.0"

2.3 Pretrained Model

mkdir <your_home_dir>/chinese-hubert-large

Download TencentGameMate/chinese-hubert-large from here.

And put it to <your_home_dir>/chinese-hubert-large

2.4 Train

cd Tri/scripts
python train.py --config=<..your path/Tri/config/multimodal_context.yml>

2.5 Inference

python synthesize.py --ckpt_path "... your path/result/output_myfastdtw_batchfist_interpolate_normalize_dropout_data_decoder_val3_5_4_onehot/train_multimodal_context/multimodal_context_checkpoint_326.bin" --transcript_path "... your path/data/val/tsv/A10.tsv" --wav_path "... your path/data/val/wav/A10.wav"

3. PostProcess

Modify paths in data_process.

Run postprocess.py to smooth the output.

Run postprocess_3.py to deflate the output.

Run postprocess_2.py to perform a weighted average of the results from multiple models.

Run add_eye to select suitable eye expressions from the training and validation sets to add. Methodology for adding: Find the closest csv file with frame number greater than or equal to the generated Blendshape in the training and validation sets, and intercept the eye action with the generated Blendshape frame number as the added eye action. Frame count in the training and validation sets:

4. Model performance

Average time to process 1s audio: 0.025s

5. Visualization

Download .fbx model (e.g. by iphone) and you can use blender.py based on blender to visualize the blendshape file .csv like this:

Visualization.mp4

The final video rendering (like 申䒕雅) of the subjective evaluation is generated by the organizer based on the .csv blendshape file. Due to copyright issues not shown here.

6. Conclusion

As you can see, our model is fairly simple, and just analyzing the data and processing it can be a very significant improvement to the results. Please feel free to contact me (yangsc21@mails.tsinghua.edu.cn) with any question or concerns.

7. [new!] Pretrained model

please see ./result/

flashpoint493 / Speech-driven-expressions