This repository showcases a graph-based approach and the Clonal Selection Algorithm (CLONALG) for text augmentation in Natural Language Processing (NLP) tasks.
Annotated data plays a crucial role in training machine learning models. However, manually labeling large amounts of data with high-quality annotations can be time-consuming and labor-intensive. In the field of NLP, the labels provided by human annotators vary in competency, training, and experience, leading to arbitrary and ambiguous standards. To address the challenges of insufficient high-quality labels, researchers have been exploring automated methods for enhancing training and testing datasets.
In this paper, we present a novel method that leverages the Clonal Selection Algorithm (CLONALG) and abstract meaning representation (AMR) graphs to improve the quality and quantity of data in two cybersecurity problems: fake news identification and sensitive data leak detection. Our proposed approach demonstrates significant enhancements in dataset performance and classification accuracy, surpassing baseline results by at least 5%.
This repository contains the following files and directories:
- data/: This directory contains the dataset files used in the experiments.
- src/: This directory contains AMR distance Metrics Code.
- code/: This directory contains the implementation of the TextAugmentation-CLONALG-AMR method.
- results/: This directory stores the results obtained from applying the method on the datasets.
- README.md: This file provides an overview of the repository and the research paper.
- LICENSE: This file contains the licensing information for the repository.
To utilize the TextAugmentation-CLONALG-AMR method for text augmentation, follow these steps:
- Clone this repository to your local machine.
- Install the necessary dependencies mentioned in the requirements file.
- Create a
data
folder in the root directory and place your dataset CSV file inside it. Make sure the column name containing the text data is named 'text' in the CSV file. - Open the
main.py
file and modify theinput_file
variable to specify the path to your CSV file. - Run the
main.py
script. - The augmented data will be saved as a new CSV file in the
results
directory with the same name as the input file but with "_augmented" appended to it.
To run the text augmentation on your own CSV file:
-
Create a
data
andresults
folder in the root directory of the repository. -
Place your dataset CSV file inside the
data
folder. -
Open the
main.py
file and modify theinput_file
variable:input_file = "data/your_file.csv" # Specify the path to your CSV file
This repository is being actively developed and updated. Please check back for additional features and improvements.