Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Paper Poster | PDF Paper | HTML Paper | Citation

This repository contains the core implementation of the Cross-Speakre Encoding (CSE) and CSE-SOT network.

Requirments

ESPnet and its required dependencies
Additional packages used for scoring can be found in ./scoring/requirements.txt

Usage

To use this code, please:

Replace the original ESPnet code with code under the ./espnet2-patch directory
Run ESPnet ASR recipe (refer to Librispeech recipe) using configurations under the ./config directory
After ESPnet scoring, additionally perform permutation-invariant scoring with ./scoring/run_pi_scoring.sh

./run.sh provides a running demo for more useful details. Please note that this code was developed under ESPnet 202209 version and could be incompatible with later versions.

Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.

@article{kang2024cross,
  title={Cross-Speaker Encoding Network for Multi-Talker Speech Recognition},
  author={Kang, Jiawen and Meng, Lingwei and Cui, Mingyu and Guo, Haohan and Wu, Xixin and Liu, Xunying and Meng, Helen},
  journal={arXiv preprint arXiv:2401.04152},
  year={2024}
}

Contact

Feel free to contact me if you have any question.

Acknowledgements

This repository is based on ESPnet speech processing toolkit, version 202209.

kjw11 / CSEnet-ASR