Haoyu-Gao/READMESimplification

Evaluating Transfer Learning for Simplifying GtiHub READMEs

This is the code for paper "Evaluating Transfer Learning for Simplifying GitHub READMEs" in the proceedings of 31th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023).

Please kindly find our preprint paper on Arxiv.

Introduction

In this paper, we harvested a new software-related simplification dataset. A transformer model is trained both Wikipedia to Simple Wikipedia dataset as well as our newly proposed dataset.
We experimented with transfer learning, which generates better results.

Package requirements

In order to run the code, please install packages include:

PyGithub
nltk
pytorch
numpy
scipy
BeautifulSoup
pytorch-transformers
pytorch-beam-search

Folders Walkthrough

Github_API

This folder contains code for harvesting data from GitHub.

Aligner

aligner/ contains all steps for preprocessing collected data and perform alignment task.
The BERT checkpoint for doing the alignment task was from Jiang et al. (Neural CRF Model for Sentence Alignment in Text Simplification). We only make some modifications on it to fit for our case. The BERT checkpoint can be accessed throught this link.
E.g. use the following command to align sentences.

python main.py --ipath=../data/db_eliminated_duplicate.txt --bert=../BERT_wiki --opath="../data/output.txt"

Simplification

simplification/ contains training, evaluation and generation steps.
To train the model, use command

python3 train.py --config=training_config.json --model=model_config.json --save_path=to_path --data_source=wiki

You can specify the model configuration in the model_config.json file. Hyperparameters are adjustable in train_config.json. Two data sources are available for training, namely wiki and software simplification corpus.

To generate simplified sentences using a model checkpoint, use command

python3 generate.py --model=model_checkpoint --path=src_sentence_file --beam=5 --to_path=write_path

After generating the simplified sentence, you could use BLEU score to evaluate the model performance, use command

python3 evaluate.py --candidate=generated_sentences_file --reference=reference_sentences_file

Model Checkpoints

To visit the model checkpoints for our survey, go to this link.

Survey

BLEU score is not ideal for evaluating the simplification system. In this research, we performed a survey. Please find the survey annotation and scores in the corresponding folder.

Citation

If you want to use our findings, please cite out paper.

@inproceedings{gao2023evaluating,
  title={Evaluating Transfer Learning for Simplifying GitHub READMEs},
  author={Gao, Haoyu and Treude, Christoph and Zahedi, Mansooreh},
  booktitle={Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
  pages={1548--1560},
  year={2023}
}

Haoyu-Gao / READMESimplification