xennygrimmato/meta_learn_source_code

On-the-Fly Adaptation of Source Code Models using Meta-Learning

Disha Shrivastava, Hugo Larochelle, Danny Tarlow

This repository contains implementation and data for our work On-the-Fly Adaptation of Source Code Models using Meta-Learning. A block diagram of our approach can be found below. For more details, refer to the paper.

Dependencies

python 3.7
tensorflow-gpu 2.0.0 or tf-nightly-gpu-2.0-preview
tensor2tensor
tqdm
javac-parser

Data

Download the full Java Github Corpus from here (java_projects.tar.gz). Extract the data and place in the Raw_Data folder.
Obtain the list of projects in 1% corpus of train, test and validation splits of the data from here (*-projects.txt files). Create folders named 'train', 'val' and 'test' which contain these projects from the java_projects folder obtained in the step above. Place these folders in Raw_Data directory.
Download the jar from here (SLP-Core_v0.2.jar). Place it in the Raw_Data directory. To lex the corpus, run java -jar SLP-Core_v0.2.jar lex x x-lexed -l java where, x = train, test, val (requires a java installation). x should point to the 'train', 'val' and 'test' folders formed in the previous step. After lexing, you will see .java files inside x-lexed folders with comments removed and java tokenized text separted by tabs.
Run extract_data.py (Steps 2, 3, 5, 6). This will result in formation of data_x.txt and data_x.json files with x = train, test, val in the Preprocessed_Data directory.
Run preprocess_data.py. This will generate files basic_dict.x and episodes_x.csv where x = train, test, val

Repository Structure

Models : Directory for storing the models
Outputs : Directory for storing the outputs (output runs as well as hole features)
Trained_Models (can be downloaded from here)
- base_model : Trained base model
- tssa_fomaml : TSSA-FOMAML best model
- tssa_reptile : TSSA-Reptile best model
Preprocessed_Data
- subword_vocab.txt : Subword vocab
- subword_vocab_counts.dict : Subword vocab with counts
- token_vocab.dict : token vocab with counts
Raw_Data
- not_vocab_1_percent.txt : list of projects in the train, test and val splits of the 1% corpus and hence not to be included while forming the vocab split
data.py : Creates data iterators
model.py : Model definition and call functions
losses.py : Loss functions
generate_episodes.py : Creates episodes consisting of hole target and coreesponding support tokens
test.py : Evaluation script
train_base_model.py : Training of base model
meta_train.py : Training with TSSA-FOMAML or TSSA-Reptile
extract_data.py : To extract 1% corpus from raw data and generate json and text files
preprocess_data.py : To preprocess data
runs.txt : Stores meta-info corresponding to each run

Replicating results

The trained models can be downloaded from here(Place it in the root folder). To replicate results in Table-2 of the paper, run the commands below:

Base Model: python test.py --method base_model --comment test_base_model
Dynamic Evaluation: python test.py --method dyn_eval --inner_learning_rate 1e-3 --comment test_dyn_eval
TSSA-1: python test.py --method tssa --inner_learning_rate 5e-3 --num_of_updates 1 --sup_def proj --num_sup_tokens x --sup_batch_size x --comment test_tssa_1_x (where x = 256, 512, 1024)
TSSA-k: python test.py --method tssa --inner_learning_rate 5e-4 --num_of_updates 16 --num_sup_tokens x --comment test_tssa_k_x (where x = 256, 512, 1024)
TSSA-Reptile: python test.py --method tssa --inner_learning_rate 5e-4 --num_of_updates 16 --num_sup_tokens x --model_load_dir 'Trained_Models/tssa_reptile/' --comment test_tssa_reptile_x (where x = 256, 512, 1024)
TSSA-FOMAML: python test.py --method tssa --inner_learning_rate 5e-4 --num_of_updates 16 --num_sup_tokens x --model_load_dir 'Trained_Models/tssa_fomaml/' --comment test_tssa_fomaml_x (where x = 256, 512, 1024)

To train the base model run: python train_base_model.py with default parameters

To meta-train with Reptile : python meta_train.py --train_method reptile --num_sup_tokens 512 --num_of_updates 32 --sup_def proj --inner_learning_rate 5e-5 --checkpoint_dir Models/tssa_reptile --comment train_val_tssa_reptile

To meta-train with FOMAML : python meta_train.py --train_method reptile --num_sup_tokens 1024 --num_of_updates 14 --checkpoint_dir Models/tssa_fomaml --comment train_val_tssa_fomaml

Disclaimer: In some versions of tf-nightly-gpu, you might get an error regarding the use of experimental_ref() for tqdm progress bar. In those cases just remove experimental_ref() and the script should run fine.

xennygrimmato / meta_learn_source_code

On-the-Fly Adaptation of Source Code Models using Meta-Learning

Dependencies

Data

Repository Structure

Replicating results

About

Languages