usc-sail / fed-ser-semi

Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling

This repository contains the official implementation (in PyTorch) of Semi-FedSER. If you have any questions, please email me at tiantiaf@gmail.com or tiantiaf@usc.edu.

Speech Features

We extract a variety of speech representations using OpenSMILE toolkit and pretrained models. You can refer to OpenSMILE and SUPERB paper for more information.

Below is a listed of features that we include in the current experiment:

Publication Date Model Name Paper Input Stride Pre-train Data Official Repo
--- EmoBase --- MM'10 Speech --- --- EmoBase
5 Apr 2019 APC apc arxiv Mel 10ms LibriSpeech-360 APC
12 Jul 2020 TERA tera arxiv Mel 10ms LibriSpeech-960 S3PRL
Dec 11 2020 DeCoAR 2.0 decoar2 arxiv Mel 10ms LibriSpeech-960 speech-representations

Federated Learning Overview

Let's recap the basic of the FL.

  1. In a typical FL training round shown in the figure below, a subset of selected clients receive a global model, which they can locally train with their private data.

  2. Afterward, the clients only share their model updates (model parameters/gradients) to the central server.

  3. Finally, the server aggregates the model updates to obtain the global model for the next training round.

Alt text

Challenge in FL

One major challenge in FL is that high-quality labeled data samples do not often exist, and most data samples are indeed unlabeled. To address this, Semi-FedSER performs the model training, utilizing both labeled and unlabeled data samples at the local client.

Semi-FedSER also incoporate with pseudo-labeling using the idea of multiview pseudo-labeling, and we adopt an efficient yet effective data augmentation technique called: Stochastic Feature Augmentation (SFA). The algorithm of the pseudo-labeling process is below.

In order to further address the gradient drifting issue in non-IID setting of FL, we add the implementation of SCAFFOLD. The final training algorithm is below.

Final SER performance

Alt text

Other Referecences

OpenSMILE

@inproceedings{eyben2010opensmile,
  title={Opensmile: the munich versatile and fast open-source audio feature extractor},
  author={Eyben, Florian and W{\"o}llmer, Martin and Schuller, Bj{\"o}rn},
  booktitle={Proceedings of the 18th ACM international conference on Multimedia},
  pages={1459--1462},
  year={2010}
}

SUPERB

@inproceedings{yang21c_interspeech,
  author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
  title={{SUPERB: Speech Processing Universal PERformance Benchmark}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1194--1198},
  doi={10.21437/Interspeech.2021-1775}
}

Image credit: OpenMoji

About

Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling

License:MIT License


Languages

Language:Python 98.0%Language:Shell 2.0%