Imtiazkarimik23 / SPEC5G

This repository contains the code and data of the paper titled "SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis" published at AACL 2023.

Home Page:https://arxiv.org/abs/2301.09201

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Updates

  • (August 8, 2023) Added annotator reasoning for a subset of examples of security classification.
  • (Initial Release) The pretrained models are now available for download.

SPEC5G

This repository contains the code and data of the paper titled "SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis" which is accepted in AACL 2023.

SPEC5G is a dataset for the analysis of natural language specification of 5G Cellular network protocol specification. SPEC5G contains 3,547,587 sentences with 134M words, from 13094 cellular network specifications and 13 online websites. By leveraging large-scale pre-trained language models that have achieved state-of-the-art results on ML-based natural language processing (NLP) tasks, we have used this dataset for security-related text classification and summarization. Security-related text classification can be used to extract relevant security-related properties for protocol testing. On the other hand, summarization can help developers and practitioners understand the high level of the protocol, which is itself a daunting task.

SPEC5G is the first-ever public 5G dataset for NLP research on network security.

Table of Contents

Datasets

Download the dataset from here. This includes:

  • Our original 134M Word training corpus (Gold_5G_v4.0.zip)
  • 5GSum - Summarization Dataset (simplification_dataset.csv)
  • 5GSC - Classification Dataset (5GSC.csv)
  • 5GSC Annotator Reasoning - Annotator Explanation for Subset of 5GSC

Models

The pretrained model checkpoints can be found below:

Dependencies

Training & Evaluation

Citation

If you use this dataset, models, or code modules, please cite the following paper:

@InProceedings{karim-EtAl:2023:findings,
  author    = {Karim, Imtiaz  and  Mubasshir, Kazi Samin  and  Rahman, Mirza Masfiqur  and  Bertino, Elisa},
  title     = {SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis},
  booktitle      = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},
  month          = {November},
  year           = {2023},
  address        = {Nusa Dua, Bali},
  publisher      = {Association for Computational Linguistics},
  pages     = {20--38},
  url       = {https://aclanthology.org/2023.findings-ijcnlp.3}
}

About

This repository contains the code and data of the paper titled "SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis" published at AACL 2023.

https://arxiv.org/abs/2301.09201

License:Apache License 2.0


Languages

Language:Jupyter Notebook 99.7%Language:Python 0.3%