🦜 Mockingjay
Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders
PyTorch Official Implementation
- This is an open source project for Mockingjay, an unsupervised algorithm for learning speech representations introduced and described in the paper "Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders", which is accepted as a Lecture in ICASSP 2020.
- We compare our speech representations with the APC and CPC approach, evaluating on 3 downstream tasks including: phone classification, speaker recognition, and sentiment classification on spoken content.
- Feel free to use or modify them, any bug report or improvement suggestion will be appreciated. If you have any questions, please contact r07942089@ntu.edu.tw. If you find this project helpful for your research, please do consider to cite this paper, thanks!
- Below we illustrate the proposed Masked Acoustic Model pre-training task, where 15% of input the frames are masked to zero at random during training. Which is reminiscent of the Masked Language Model task of BERT-style pre-training from the NLP ccommunity.
Results
- We provide furthur frame-wise phone classification results, which is not included in our previous paper, comparing with the "Contrastive Predictive Coding, CPC" method, using identical phone labels and train/test split as provided in the CPC paper.
- We pre-train Mockingjay on the 100hr subset of LibriSpeech, same as CPC.
- There are 41 possible classes, phone classification results on LibriSpeech:
Features | Pre-train | Linear Classifier | 1 Hidden Classifier |
---|---|---|---|
MFCC | None | 39.7 | |
CPC | 100 hr | 64.6 | 72.5 |
BASE (Ours) | 100 hr | 64.3 | 76.8 |
BASE (Ours) | 360 hr | 66.4 | 77.0 |
BASE (Ours) | 960 hr | 67.0 | 79.1 |
Highlight
Pre-trained Models
You can find pre-trained models here:
http://bit.ly/result_mockingjay
Their usage are explained bellow and furthur in Step 3 of the Instruction Section.
Extract features or fine-tuning with your own downstream models (RECOMMEND)
With this repo and the trained models, you can fine-tune the pre-trained Mockingjay model on your own dataset and tasks (important: the input acoustic features must use the same preprocessing settings!!!). To do so, use the wrapper class in nn_mockingjay.py, and take a look at the following example python code (example_extract_finetune.py):
import torch
from mockingjay.nn_mockingjay import MOCKINGJAY
from downstream.model import example_classifier
from downstream.solver import get_mockingjay_optimizer
# setup the mockingjay model
options = {
'ckpt_file' : 'result/result_mockingjay/mockingjay_libri_sd1337_MelBase/mockingjay-500000.ckpt',
'load_pretrain' : 'True',
'no_grad' : 'False',
'dropout' : 'default'
}
model = MOCKINGJAY(options=options, inp_dim=160)
# setup your downstream class model
classifier = example_classifier(input_dim=768, hidden_dim=128, class_num=2).cuda()
# construct the Mockingjay optimizer
params = list(model.named_parameters()) + list(classifier.named_parameters())
optimizer = get_mockingjay_optimizer(params=params, lr=4e-3, warmup_proportion=0.7, training_steps=50000)
# forward
example_inputs = torch.zeros(1200, 3, 160) # A batch of spectrograms: (time_step, batch_size, dimension)
reps = model(example_inputs) # returns: (time_step, batch_size, hidden_size)
reps = reps.permute(1, 0, 2) # change to: (batch_size, time_step, feature_size)
labels = torch.LongTensor([0, 1, 0]).cuda()
loss = classifier(reps, labels)
# update
loss.backward()
optimizer.step()
# save
PATH_TO_SAVE_YOUR_MODEL = 'example.ckpt'
states = {'Classifier': classifier.state_dict(), 'Mockingjay': model.state_dict()}
torch.save(states, PATH_TO_SAVE_YOUR_MODEL)
Extracting Speech Representations with Solver
With this repo and the trained models, you can use it to extract speech representations from your target dataset (important: the input acoustic features must use the same preprocessing settings!!!). To do so, feed-forward the trained model on the target dataset and retrieve the extracted features by running the following example python code (example_solver.py):
import torch
from runner_mockingjay import get_mockingjay_model
example_path = 'result/result_mockingjay/mockingjay_libri_sd1337_LinearLarge/mockingjay-500000.ckpt'
mockingjay = get_mockingjay_model(from_path=example_path)
# A batch of spectrograms: (batch_size, seq_len, hidden_size)
spec = torch.zeros(3, 800, 160)
# reps.shape: (batch_size, num_hiddem_layers, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=True)
# reps.shape: (batch_size, num_hiddem_layers, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=False)
# reps.shape: (batch_size, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=True)
# reps.shape: (batch_size, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=False)
spec
is the input spectrogram of the mockingjay model where:
spec
needs to be a PyTorch tensor with shape of(seq_len, mel_dim)
or(batch_size, seq_len, mel_dim)
.mel_dim
is the spectrogram feature dimension which by default ismel_dim == 160
, see utility/audio.py for more preprocessing details.
reps
is a PyTorch tensor of various possible shapes where:
batch_size
is the inference batch size.num_hiddem_layers
is the transformer encoder depth of the mockingjay model.seq_len
is the maximum sequence length in the batch.downsample_rate
is the dimensionality of the transformer encoder layers.hidden_size
is the number of stacked consecutive features vectors to reduce the length of input sequences.
The output shape of reps
is determined by the two arguments:
all_layers
is a boolean which controls whether to output all the Encoder layers, ifFalse
returns the hidden of the last Encoder layer.tile
is a boolean which controls whether to tile representations to match the inputseq_len
ofspec
.
As you can see, reps
is essentially the Transformer Encoder hidden representations in the mockingjay model. You can think of Mockingjay as a speech version of BERT if you are familiar with it.
There are many ways to incorporate reps
into your downtream task. One of the easiest way is to take only the outputs of the last Encoder layer (i.e., all_layers=False
) as the input features to your downstream model, feel free to explore other mechanisms.
Requirements
- Python 3
- Pytorch 1.3.0 or above
- Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is extremely important if you'd like to train your own model.
- Required packages and their use are listed below, and also in requirements.txt:
editdistance # error rate calculation
joblib # parallel feature extraction & decoding
librosa # feature extraction (for feature extraction only)
pydub # audio segmentation (for MOSEI dataset preprocessing only)
pandas # data management
tensorboardX # logger & monitor
torch # model & learning
tqdm # verbosity
yaml # config parser
matplotlib # visualization
ipdb # optional debugger
numpy # array computation
scipy # for feature extraction
The above packages can be installed by the command:
pip3 install -r requirements.txt
Below we list packages that need special attention, and we recommand you to install them manually:
apex # non-essential, faster optimization (only needed if enabled in config)
sentencepiece # sub-word unit encoding (for feature extraction only, see https://github.com/google/sentencepiece#build-and-install-sentencepiece for install instruction)
Instructions
Before you start, make sure all the packages required listed above are installed correctly
Step 0. Preprocessing - Acoustic Feature Extraction & Text Encoding
See the instructions on the Preprocess wiki page for preprocessing instructions.
Step 1. Configuring - Model Design & Hyperparameter Setup
All the parameters related to training/decoding will be stored in a yaml file. Hyperparameter tuning and massive experiment and can be managed easily this way. See config files for the exact format and examples.
Step 2. Training the Mockingjay Model for Speech Representation Learning
Once the config file is ready, run the following command to train unsupervised end-to-end Mockingjay:
python3 runner_mockingjay.py --train
All settings will be parsed from the config file automatically to start training, the log file can be accessed through TensorBoard.
Step 3. Using Pre-trained Models on Downstream Tasks
Once a Mockingjay model was trained, we can use the generated representations on downstream tasks. See the Experiment section for reproducing downstream task results mentioned in our paper, and see the Highlight section for incorporating the extracted representations with your own downstream task.
Pre-trained models and their configs can be download from HERE.
To load with default path, models should be placed under the directory path: --ckpdir=./result_mockingjay/
and name the model file manually with --ckpt=
.
Step 4. Loading Pre-trained Models and Visualize
Run the following command to visualize the model generated samples:
# visualize hidden representations
python3 runner_mockingjay.py --plot
# visualize spectrogram
python3 runner_mockingjay.py --plot --with_head
Note that the arguments --ckpdir=XXX --ckpt=XXX
needs to be set correctly for the above command to run properly.
Step 5. Monitor Training Log
# open TensorBoard to see log
tensorboard --logdir=log/log_mockingjay/mockingjay_libri_sd1337/
# or
python3 -m tensorboard.main --logdir=log/log_mockingjay/mockingjay_libri_sd1337/
Experiments
Application on downstream tasks
See the instructions on the Downstream wiki page to reproduce our experiments.
Comparing with APC
See the instructions on the APC wiki page to reproduce our experiments. Comparison results are in our paper.
Comparing with CPC
See the instructions on the Downstream wiki page to reproduce our experiments. Comparison results are in the first section.
Reference
- Montreal Forced Aligner, McAuliffe et. al.
- CMU MultimodalSDK, Amir Zadeh.
- PyTorch Transformers, Hugging Face.
- Autoregressive Predictive Coding, Yu-An Chung.
- Contrastive Predictive Coding, Aaron van den Oord.
- End-to-end ASR Pytorch, Alexander-H-Liu.
- Tacotron Preprocessing, Ryuichi Yamamoto (r9y9)
Citation
@misc{liu2019mockingjay,
title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},
author={Andy T. Liu and Shu-wen Yang and Po-Han Chi and Po-chun Hsu and Hung-yi Lee},
year={2019},
eprint={1910.12638},
archivePrefix={arXiv},
primaryClass={eess.AS}
}