Within-Project Defect Prediction of Infrastructure-as-Code Using Product and Process Metrics

Replication package for the paper Within-Project Defect Prediction of Infrastructure-as-Code Using Product and Process Metrics accepted for publication in IEEE Transantions on Software Engineering (TSE 2020).

How to cite

@ARTICLE{9321740,  
  author={{Dalla Palma}, Stefano and {Di Nucci}, Dario and Palomba, Fabio and Tamburri, Damian A.},  
  journal={IEEE Transactions on Software Engineering},   
  title={Within-Project Defect Prediction of Infrastructure-as-Code Using Product and Process Metrics},   
  year={2022},  
  volume={48},  
  number={6},  
  pages={2086-2104},  
  doi={10.1109/TSE.2021.3051492}
}

Info

HYPER-PARAMETERS.md - Hyper-parameters used to train the models (introduced in Section 3.4).
LABELS.md - Labels used to identify bug-related issues (introduced in Section 3.3).
METRICS.md - Table of metrics used to train the models (introduced in Section 3.3).
RQ1.md - RQ1 results.
RQ2.md - RQ2 results.
RQ3.md - RQ3 results.
RQ3-additional.md - Results of the additional analysis for RQ3.

Data

For the sake of size limitation, the raw data has been uploaded on Zenodo.

In this repo you can find the followind data:

collected-repositories.csv - the 1050 collected repositories.
selected-repositories.csv - the 200 repositories that satisfied the inclusion criteria in paper's Table 1.
analyzed-repositories.csv - the 104 repositories used to answer the RQs.
fixing-commits-validation.csv - sample of manually assessed defect-fixing commits. The complete list of defect-fixing commits is available on Zenodo.
szz-validation.csv - sample of manually assessed defect-inducing commits. The complete list of defect-introducing commits is available on Zenodo.
rq1.csv - data collected to answer RQ1 (Techniques performance).
rq2.csv - data collected to answer RQ2 (Metrics performance).
rq3.json - data collected to answer RQ3 (Recursive Feature Elimination).

Kaggle

On kaggle is the dataset containing the data to build the models. Go to Kaggle or download the dataset.

The kernels are divided in four groups:

rq1/<owner>/<repository> - used for RQ1.
tse2020/rq2/<owner>/<repository> - used for RQ2.
tse2020/rq3/<owner>/<repository> - used for RQ3.
tse2020/add/<owner>/<repository> - used for an additional analysis of RQ3.

Tool Suite

The RADON Framework for IaC Defect Prediction is available on Github. Below the tools we used to build it.

IaC Github Repositories Collector - To collect active repositories. See on Github.
Repository Scorer - To collect repository metrics based on best engineering practices. See on Github.
IaC Repository Miner - To mine repositories and collect product, delta, and process metrics. See on Github.

3.1. AnsibleMetrics - To extract product metrics for Ansible. See on Github.

3.1. PyDriller - To analyze commit history and extract process metrics. See on Github.
IaC Defect Predictor - To build and evaluate models. See on Github.

stefanodallapalma / TSE-2020-05-0217