ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

Code for the paper ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications.

Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications....

ATC aims at guiding aircraft and controlling the 
airspace in a safe and optimal manner. These voice-based dialogues 
are carried between an air traffic controller (ATCO) and pilots via 
very-high frequency radio channels. In order to incorporate these 
novel technologies into ATC (low-resource domain), large-scale 
annotated datasets are required to develop the data-driven AI 
systems. Two examples are automatic speech recognition (ASR) and 
natural language understanding (NLU). In this paper, we introduce the 
ATCO2 corpus, a dataset that aims at fostering research on the 
challenging ATC field, which has lagged behind due to lack of 
annotated data. The ATCO2 corpus covers 1) data collection and pre-
processing, 2) pseudo-annotations of speech data, and 3) extraction 
of ATC-related named entities. The ATCO2 corpus is split into three 
subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with 
manual transcripts and a subset with gold annotations for named-
entity recognition (callsign, command, value). 2) The ATCO2-PL-set 
corpus consists of 5281 hours of unlabeled ATC data enriched with 
automatic transcripts from an in-domain speech recognizer, contextual 
information, speaker turn information, signal-to-noise ratio estimate 
and English language detection score per sample. Both available for 
purchase through ELDA at this http URL. 3) The ATCO2-test-set-1h 
corpus is a one-hour subset from the original test set corpus, that 
we are offering for free at this https URL. We expect the ATCO2 
corpus will foster research on robust ASR and NLU not only in the 
field of ATC communications but also in the general research 
community.

ATCO2 corpus ecosystem. Blue circles denote annotations only available for ATCO2 test set corpus. Green circles denote annotations and metadata available for both ATCO2 test set and ATCO2 pseudo-labeled corpus sets.

Repository written by: Juan Pablo Zuluaga.

Preparing Environment

The first step is to create your environment with the required packages for data preparation, formatting, and to carry out the experiments. You can run the following commands to create the conda environment (assuming CUDA - 11.7):

Step 1: Using python 3.10: install python and the requirements

git clone https://github.com/idiap/w2v2-air-traffic
conda create -n atco2_corpus python==3.10
conda activate atco2_corpus
python -m pip install -r requirements.txt

Before running any script, make sure you have en_US locale set and PYTHONPATH in repository root folder.

export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
export PYTHONPATH=$PYTHONPATH:$(pwd) #assuming you are in root repository folder

Usage

There are several steps to replicate/use our proposed models:

Out-of-the box model on HuggingFace

What can you do with ATCO2 corpus?

Automatic Speech Recognition

This system allows to optain the text level information of what was said in the ATC communication. It is normally used later in the next systems below

Speaker Role Identification

With this module, you can detect who is talking in the given communication

Named-Entity Recognition

Here, you aim at understanding what was said in the communicaiton. With ATCO2 corpus you can train a system that can detect callsigns, commands and values in the communication.

Related work

Here is a list of papers that are somehow related to AI/ML targeted to Air traffic control communications:

Fine-tuning a pretrained BERT model on the named entity recognition task to perform text-based diarization for ATC communications:
- paper: BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications
- code: https://github.com/idiap/bert-text-diarization-atc
Fine-tuning a pretrained Wav2vec 2.0 model for automatic speech recognition:
- paper: How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications
- code: https://github.com/idiap/w2v2-air-traffic
How to use contextual data (biasing) in ATC automatic speech recognition:
- Paper: A two-step approach to leverage contextual data: speech recognition in air-traffic communications
Ethics in collection of ATC audio data: Legal and Ethical Challenges in Recording Air Traffic Control Speech

Some other papers:

How to cite us

If you use this code for your research, please cite our papers with the following bibtex items:

# article 1 - MAIN
@article{zuluaga2022atco2,
  title={ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications},
  author={Zuluaga-Gomez, Juan and Vesel{\'y}, Karel and Sz{\"o}ke, Igor and Motlicek, Petr and others},
  journal={arXiv preprint arXiv:2211.04054},
  year={2022}
}

# article 2 - Mainly on ASR
@inproceedings{zuluaga2023does,
  title={How does pre-trained Wav2Vec 2.0 perform on domain-shifted ASR? An extensive benchmark on air traffic control communications},
  author={Zuluaga-Gomez, Juan and Prasad, Amrutha and Nigmatulina, Iuliia and Sarfjoo, Seyyed Saeed and Motlicek, Petr and Kleinert, Matthias and Helmke, Hartmut and Ohneiser, Oliver and Zhan, Qingran},
  booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
  pages={205--212},
  year={2023},
  organization={IEEE}
}

# article 3 - Mainly on sequence classification and BERT  
@inproceedings{zuluaga2023bertraffic,
  title={Bertraffic: Bert-based joint speaker role and speaker change detection for air traffic control communications},
  author={Zuluaga-Gomez, Juan and Sarfjoo, Seyyed Saeed and Prasad, Amrutha and Nigmatulina, Iuliia and Motlicek, Petr and Ondrej, Karel and Ohneiser, Oliver and Helmke, Hartmut},
  booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
  pages={633--640},
  year={2023},
  organization={IEEE}
}

About

A Corpus for Research on Robust Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

MIT License

Languages

Language:Python 57.7%Language:Shell 37.1%Language:Jupyter Notebook 4.4%Language:Perl 0.8%

KarelVesely84 / atco2-corpus