ketrint / ml-snippets-classification

The source code of "Machine learning code snippets semantic classification" (Valeriy Berezovskiy, Anastasia Gorodilova, Ekaterina Trofimova, Andrey Ustyuzhanin) paper.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DOI

Machine learning code snippets semantic classification

This repository contains the source code of experiments from the paper "Machine learning code snippets semantic classification" (Valeriy Berezovskiy, Anastasia Gorodilova, Ekaterina Trofimova, Andrey Ustyuzhanin).

Preparation

Start by cloning the repository:

git clone https://github.com/vorobeevich/ml-snippets-classification

We highly recommend using conda for experiments: Anaconda.

After installation, make a new environment:

conda create --name cssc

conda activate cssc

Install the libraries from the requirements.txt. Torch versions may differ depending on your GPU: Start Locally | PyTorch

Data

Download the marked up data (7947 snippets), as well as the result of the partition algorithm from our Google Drive:

chmod 777 /src/scripts/load_data.sh

./src/scripts/load_data.sh

You can download the full version of Code4ML dataset (marked up data, a total set of 2.5 million snippets, our model predictions on all data) on Zenodo: DOI

Also, you can read the paper about Code4ML Dataset: Code4ML: a Large-scale Dataset of annotated Machine Learning Code.

Usage

To reproduce any experiment from our article, it is enough to run the training script with the desired config. Note that the result is non-deterministic (even with a fixed random seed) on various platforms due to the nature of libraries such as torch.

python src/scripts/train.py --device [ID OF CUDA DEVICE] --config src/configs/[CHOOSE CONFIG TO RUN]

About

The source code of "Machine learning code snippets semantic classification" (Valeriy Berezovskiy, Anastasia Gorodilova, Ekaterina Trofimova, Andrey Ustyuzhanin) paper.

License:MIT License


Languages

Language:Python 98.9%Language:Shell 1.1%