Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features
This fork of the original MOSA-Net Cross Domain repo, implements a number of improvements to evaluate the two pretrained models that predict VoiceMOS & {PESQ, SDI, STOI}.
The changes and light refactoring include
- Move
2.7
&3.6
(fairseq) code topython3.7
, modified to do inference oncpu
- Update strictly-required dependencies for inference, captured in a
requirements.txt
- Move essential code to a new
src
directory - Move all pretrained models to
pretrained_models
- Streamline evaluation of long files with the development of a simple CLI tool, which segments audio into utterances, before computing features (spectrogram, waveform, HuBERT) and doing inference, on each utterance, using both models.
Tested under python3.7
, install dependencies with
pip install -r requirements.txt
io_mosanet.py
and io_mosanet_crossdomain.py
evaluate VoiceMOS & {PESQ, SDI, STOI} respectively
using HuBERT features, specifically extracted using io_extract_hubert.py
To evaluate longer audio files (eg. 30min), the simple CLI tool will first VAD the audio file into utterances and then compute all metrics on each utterance.
Usage: iorife_dialog_intel_cli.py [path-to-audio-file]
Please kindly cite the original authors' paper
R. E. Zezario, S. -W. Fu, F. Chen, C. -S. Fuh, H. -M. Wang and Y. Tsao, "Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54-70, 2023, doi: 10.1109/TASLP.2022.3205757.
Self Attention, SincNet, Self-Supervised Learning Model are created by others