moncefbenaicha / SpokenNER

Spoken NER implementation based on Wav2Vec2-XLS-R with experiments on transfer learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Exploring Spoken Named Entity Recognition: A Cross-Lingual Perspective

Git Repo for the paper source code and data
Paper · Data · Contact

Table of Contents
  1. Paper Abstract
  2. About This Repo
  3. Data
  4. Usage
  5. Contact
  6. Acknowledgments
  7. Paper Citation

Paper Abstract

Recent advancements in Named Entity Recognition (NER) have significantly improved the identification of entities in textual data. However, spoken NER, a specialized field of spoken document retrieval, lags behind due to its limited research and scarce datasets. Moreover, cross-lingual transfer learning in spoken NER has remained unexplored. This paper utilizes transfer learning across Dutch, English, and German using pipeline and End-to-End (E2E) schemes. We employ Wav2Vec2-XLS-R models on custom pseudo-annotated datasets and investigate several architectures for the adaptability of cross-lingual systems. Our results demonstrate that End-to-End spoken NER outperforms pipeline-based alternatives over our limited annotations. Notably, transfer learning from German to Dutch surpasses the Dutch E2E system by 7% and the Dutch pipeline system by 4%. This study not only underscores the feasibility of transfer learning in spoken NER but also sets promising outcomes for future evaluations, hinting at the need for comprehensive data collection to augment the results.

About This Repo

This repository includes the following:

  • The source code we used to run different experiments in this work
  • Links to access different datasets and data format description
  • Usage guide to re-run the experiments


  • For all experiments in this work, we relied on data from Common Voice Corpus v12.0 (07/12/2022) (
  • For every language (English, German, Dutch), we apply some data pre-processing as follows:
    • Remove all duplicated utterances
    • Remove all utterances where the audio standard deviation is equal to or lower than the threshold 0.001 (Empty or very inaudible audio)
    • Remove all punctuation from the transcription, except for English, we keep the apostrophe (').
    • In some utterances, we find that the transcription has multiple languages, for example English sentence with Russian or Chinese names. Those names will be converted to the Latin alphabet. This doesn't apply to special German letters (ä, ö, ü, ß).
  • After pre-processing we use our best NER model to generate pseudo-annotation from the transcription, then all the transcription is converted to lowercase
  • We save the previous steps output of each language on a Json file in the following format:
              "meta": {
                  "Contains meta information about the dataset"
              "data": [
                      "Original Transcription",
  • Our source code heavily depends on this format.
  • The Json files can be found on:





python \
  --task "End2end Training ${LANGUAGE} XLS-R 300" \
  --data_path $DATA_BASE_PATH"/${LANGUAGE}/cv_${LANGUAGE}_dataset.json" \
  --vocab_path $DATA_BASE_PATH"/${LANGUAGE}/${LANGUAGE}_vocab_with_tags.json" \
  --clips_path $DATA_BASE_PATH"/${LANGUAGE}/clips" \
  --output_path $OUTPUT_BASE_PATH"/${LANGUAGE}/" \
  train \
    --pretrained_model $MODEL

Transfer Learning


PORTION=40        # This refers to 40% of target language train data


python \
  --task "Transfer From German to NL XLS-R 300" \
  --data_path $DATA_BASE_PATH"/${LANGUAGE}/cv_${LANGUAGE}_dataset.json" \
  --vocab_path $DATA_BASE_PATH"/de/de_vocab_with_tags.json" \
  --clips_path $DATA_BASE_PATH"/${LANGUAGE}/clips" \
  --output_path $OUTPUT_BASE_PATH"/transfer_to_${LANGUAGE}_p${PORTION}/" \
  transfer \
    --source_model $MODEL \
    --data_proportion ${PORTION} \
    --epochs 30

Evaluation with and without a Language Model



python \
 --task "End2end Evaluation ${LANGUAGE} XLS-R 300" \
 --data_path $DATA_BASE_PATH"/${LANGUAGE}/cv_${LANGUAGE}_dataset.json" \
 --vocab_path $DATA_BASE_PATH"/${LANGUAGE}/${LANGUAGE}_vocab_with_tags.json" \
 --clips_path $DATA_BASE_PATH"/${LANGUAGE}/clips" \
 --output_path $OUTPUT_BASE_PATH"/evaluation/${LANGUAGE}/no_lm" \
 --batch_size 8 \
 evaluate \
   --asr_model $MODEL

python \
 --task "End2end Evaluation ${LANGUAGE} XLS-R 300 with LM" \
 --data_path $DATA_BASE_PATH"/${LANGUAGE}/cv_${LANGUAGE}_dataset.json" \
 --vocab_path $DATA_BASE_PATH"/${LANGUAGE}/${LANGUAGE}_vocab_with_tags.json" \
 --clips_path $DATA_BASE_PATH"/${LANGUAGE}/clips" \
 --output_path $OUTPUT_BASE_PATH"/evaluation/${LANGUAGE}/lm" \
 --batch_size 8 \
 evaluate \
   --asr_model $MODEL \
   --language_model $DATA_BASE_PATH"/${LANGUAGE}/${LANGUAGE}" \
   --lm_alpha 1.0 \
   --lm_beta 3.3 \
   --lm_beam_size 2000 \


In case you have any questions or inquiries, feel free to send an email to: Moncef or Tuğtekin


This work was done during my work at Fraunhofer IAIS and it's supported by the European Union’s Horizon 2020 Research and Innovation Program under Grant Agreement No. 957017 SELMA (

Paper Citation

      title={Exploring Spoken Named Entity Recognition: A Cross-Lingual Perspective}, 
      author={Moncef Benaicha and David Thulke and M. A. Tuğtekin Turan},


Spoken NER implementation based on Wav2Vec2-XLS-R with experiments on transfer learning


Language:Python 100.0%