GumTreeDiff / datasets

A collection of diff datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Diff datasets

A collection of diff datasets. It contains:

  • GitHub Java is a Java dataset containing 1000 commits from 10 popular projects.
  • GitHub Python is a Python dataset containing 1000 commits from 10 popular projects.
  • Defects4J is a Java dataset of bug fixes used in the program repair community.
  • BugsInPy is a Python dataset of bug fixes used in the program repair community.

The layout of these datasets is the following: the before folders contain the files before modification, and the after folders contain the files after. Inside the before and after folders, there is one folder per project that contains one folder per commit. Note that the commit names are the same in the before and after folders. The unparsable folder contains the commits from the previous datasets for which we could not parse one of the files.

The Python scripts used to produce the datasets are also provided.

About

A collection of diff datasets


Languages

Language:Python 53.6%Language:Java 46.3%Language:Jupyter Notebook 0.1%