deep-learning machine-learning korean-drama text-to-speech tensorflow speech-sythesis cross-language voice-cloning

Deepdubpy

A complete end-to-end Deep Learning system to generate high quality human like speech in English for Korean Drama. (WIP)

Status

Check Projects.

What am I doing here?

There are various steps, I came up with

Step 0: Preprocessing subtitles to get sentences

The relies heavily on Subtitles for the dubbing procedure to work, i.e., the subs should match the intended audio in the video file. If they don't use shift parameter of DeepdubSentence constructor. These sentences (stored in sentence_df) are used to create audio segments in step 1.

Step 1: Generating audio segments

The sentence_df can then be used to create audio segments, more than enough accurate mapping of sentences to spoken audio. We also create segments which does not contain any spoken sentence/dialog as per preprocessed subtitles and writing them to <hash>.wav (hash of the start and ending timestamp of sentence from sentence_df. All of these file names are written to audio_segments_list.txt to concatenate back the generated audio and other audio segments which doesn't contain any spoken dialog. audio_df dataframe stores all the audio segments information, storing exact start and stop time stamp.

Step 2: Source separation/separating accompaniments and vocals.

The background sound effects/accompaniments will behave as noise for our next step 3, (and possibly step 4). This problem is solved by using source separation technique (using Spleeter) spliting original audio containing a speech, into <hash>_vocals.wav and <hash>_accopaniments.wav. This step is performed only for the audio segments containing known speech (i.e., based on sentence_df), completely retaining background sound effects for audio segments which doensn't contain any speech.

Step 3: Clustering audio Segments for speaker Diarization

We don't know who spoke a particular audio segment just from subtitles. We need to give labels to audio segments so that we can dub that particular audio segment into that particular speaker's voice. For this I have applied clustering to speaker embeddings of audio segments, creating labels.

Step 4: Voice Reproduction

We already know which audio segment is spoken by which speaker in previous step. We can use these speech segments for that particular speaker for voice adaptation, generating speech (<hash>_gen.wav) using a TTS (Text-To-Speech) model and preprocessed subs (sentences).

Step 5: Accompaniments Overlay and Concatenation of audio segments.

The generated speech (<hash>.wav) is overlayed with accompaniments (<hash>_accompaniments.wav) to get <hash>_gen.wav. This ensures that we have speech in intended language + sound effects are preserved. At last we use audio_segments_list.txt to concatenate back the audio segments and produce the final output audio.

Want to Contribute?

Look into issues. You can begin with issue tagged good first issue or if you want to suggest something else, open a new issue.

This project uses Spleeter for source separation.

@article{spleeter2020,
  doi = {10.21105/joss.02154},
  url = {https://doi.org/10.21105/joss.02154},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {50},
  pages = {2154},
  author = {Romain Hennequin and Anis Khlif and Felix Voituret and Manuel Moussallam},
  title = {Spleeter: a fast and efficient music source separation tool with pre-trained models},
  journal = {Journal of Open Source Software},
  note = {Deezer Research}
}

Install dependencies for Spleeter:

conda install -c conda-forge ffmpeg libsndfile
pip install spleeter

This project also uses Deep Speaker for speaker identification. Install it's requirements by:

pip install -r deep_speaker/requirements.txt

Download pretrained models weights from here or from here and put in ./pretrained_models folder of current directory.

About

A complete end-to-end Deep Learning system to generate high quality human like speech in English for Korean Drama (WIP)

deep-learning machine-learning korean-drama text-to-speech tensorflow speech-sythesis cross-language voice-cloning

GNU General Public License v2.0

Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%Language:Shell 0.0%