🌟replication package for 📜Self-Admitted Technical Debts Identification: How Far Are We?, SANER 2024.

2023-MT-BERT-SATD

This repository is a replica package of the paper "Self-Admitted Technical Debts Identification: How Far Are We?", including the implementation code of MT-BERT-SATD, the preprocessed complete dataset used for training, and a tutorial on how to use our well-trained model for SATD identification across various sources.

To avoid potential conflicts of interest, the original dataset collected in the article can be obtained from the following links

==========================================================================

dataset

Dataset	Sample Source	Link
Dataset-01-Comments-Dockerfile	Code comments/dockerfile	data
Dataset-02-Comments-Python	Code comments/python	data
Dataset-03-Comments-XML	Code comments/XML	data
Dataset-04-Comments-Java	Code comments/java	data
Dataset-05-Comments-Java	Code comments/java	data
Dataset-06-Comments-Java	Code comments/java	data
Dataset-07-Issue	Issue Trackers	data
Dataset-08-Issue	Issue Trackers	data
Dataset-09-PR	Pull Requests	data
Dataset-10-PR	Pull Requests	data
Dataset-11-Commits	Commit Messages	data

code

The code implemented by the MT-BERT-SATD model can be found in the code file. There are four files in the training script, namely "modeling_multitask. py", "optimization. py", "tokenization. py", and "run_mt_bert_satd. py", where the training entry is run_mt_bert_satd. py. Training requires executing the following command:

python run_mt_bert_satd.py
--data_dir {data_path}
--output_dir {model_checkpoints_save_path}
--do_train {train}
--do_eval {eval}
--train_batch_size {train_batch_size}
--eval_batch_size {eval_batch_size}
--learning_rate {learning_rate}
--num_train_epochs {epoch}
--seed {train_seed}
--patience {early_stopping_number}

tips: Before conducting model training, download the bert_base pre-training file and place it in the bert_base_uncased folder.

Download link for pre_trained models: links

predict

Download the well-trained model link
Put the downloaded three files "pytorch_model.bin", "vocab.txt" and "config.json" in the well_trained_model folder
Place the unclassified CSV file, such as "4_unclassified.csv", in the unclassified_files directory, and execute the following code to perform the identification of Self-Admitted Technical Debt (SATD).

python predict.py --task {id 1-5 } --data_dir {file_name} --output_dir {out_path}

illustrate:

The optional range of {task} is 1-5, where they represent unclassified files from 1- "Issue Trackers", 2- "Pull Requests", 3- "Commit Messages", 4- "Code Comments", and 5- "Others".<br>

For example, "4_unclassified" in the example represents data in code comments, so the detailed running code is:

python predict.py --task 4 --data_dir 4_unclassified --output_dir predict_files

zscszndxdxs / 2023-MT-BERT-SATD