yu-wang-coder / Naive_Bayesian_Classifier_MapReduce

A naive bayesian classifier for bug label classification in github issues written in python3 use MapReduce model.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Naive_Bayesian_Classifier_MapReduce

A naive bayesian classifier for bug label classification in github issues written in python3 used MapReduce model.

Dependence

1. Vmware and Ubuntu (optional):
				https://pan.baidu.com/s/1X29KTBNUx71GcqLc9aGB1Q    提取码:a9d5
2. jdk1.8: 
				https://pan.baidu.com/s/1X29KTBNUx71GcqLc9aGB1Q    提取码:a9d5
3. hadoop2.6.0:
				https://pan.baidu.com/s/1ug00xUXIIvN_zyrXsRVnSw    提取码:64kb				

Our code run on three virtual machine as an example, if you don't have enough real meachine or you don't know how to install hadoop, you can follow our environment setting

Dataset

The data are crawled from the Microsoft vscode project issues .The original data can be found at ori_data.

We aim to predict the label (bug tag) of the issues' titles.

Then we change it into MapReduce's input format as follow:

label \t\t title

We removed some data whose label appeared less than 10 times. Then we got 1935 titles for train and 850 titles for validate.

Run

  • Clone the git to your ${Hadoop home}
  • Add ${Hadoop home}/hadoop/bin to your environment path.
  • Modify the path in MapReduce_code/train.sh、predict.sh
  • cd MapReduce_code
  • sh run.sh
  • The result can be seen in final_result.txt
  • Attention:you may need to create the sh file by yourself due to the permission problem

Result

Evaluation method Result
rank1-Accuracy 32.35%
rank5-Accuracy 73.65%

About

A naive bayesian classifier for bug label classification in github issues written in python3 use MapReduce model.


Languages

Language:Python 68.9%Language:Shell 31.1%