xennygrimmato / meta_learn_source_code

Code for "On-the-Fly Adaptation of Source Code Models using Meta-Learning"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

On-the-Fly Adaptation of Source Code Models using Meta-Learning

Disha Shrivastava, Hugo Larochelle, Danny Tarlow

This repository contains implementation and data for our work On-the-Fly Adaptation of Source Code Models using Meta-Learning. A block diagram of our approach can be found below. For more details, refer to the paper.

Dependencies

  • python 3.7
  • tensorflow-gpu 2.0.0 or tf-nightly-gpu-2.0-preview
  • tensor2tensor
  • tqdm
  • javac-parser

Data

  • Download the full Java Github Corpus from here (java_projects.tar.gz). Extract the data and place in the Raw_Data folder.
  • Obtain the list of projects in 1% corpus of train, test and validation splits of the data from here (*-projects.txt files). Create folders named 'train', 'val' and 'test' which contain these projects from the java_projects folder obtained in the step above. Place these folders in Raw_Data directory.
  • Download the jar from here (SLP-Core_v0.2.jar). Place it in the Raw_Data directory. To lex the corpus, run java -jar SLP-Core_v0.2.jar lex x x-lexed -l java where, x = train, test, val (requires a java installation). x should point to the 'train', 'val' and 'test' folders formed in the previous step. After lexing, you will see .java files inside x-lexed folders with comments removed and java tokenized text separted by tabs.
  • Run extract_data.py (Steps 2, 3, 5, 6). This will result in formation of data_x.txt and data_x.json files with x = train, test, val in the Preprocessed_Data directory.
  • Run preprocess_data.py. This will generate files basic_dict.x and episodes_x.csv where x = train, test, val

Repository Structure

  • Models : Directory for storing the models
  • Outputs : Directory for storing the outputs (output runs as well as hole features)
  • Trained_Models (can be downloaded from here)
    • base_model : Trained base model
    • tssa_fomaml : TSSA-FOMAML best model
    • tssa_reptile : TSSA-Reptile best model
  • Preprocessed_Data
    • subword_vocab.txt : Subword vocab
    • subword_vocab_counts.dict : Subword vocab with counts
    • token_vocab.dict : token vocab with counts
  • Raw_Data
    • not_vocab_1_percent.txt : list of projects in the train, test and val splits of the 1% corpus and hence not to be included while forming the vocab split
  • data.py : Creates data iterators
  • model.py : Model definition and call functions
  • losses.py : Loss functions
  • generate_episodes.py : Creates episodes consisting of hole target and coreesponding support tokens
  • test.py : Evaluation script
  • train_base_model.py : Training of base model
  • meta_train.py : Training with TSSA-FOMAML or TSSA-Reptile
  • extract_data.py : To extract 1% corpus from raw data and generate json and text files
  • preprocess_data.py : To preprocess data
  • runs.txt : Stores meta-info corresponding to each run

Replicating results

The trained models can be downloaded from here(Place it in the root folder). To replicate results in Table-2 of the paper, run the commands below:

  • Base Model: python test.py --method base_model --comment test_base_model
  • Dynamic Evaluation: python test.py --method dyn_eval --inner_learning_rate 1e-3 --comment test_dyn_eval
  • TSSA-1: python test.py --method tssa --inner_learning_rate 5e-3 --num_of_updates 1 --sup_def proj --num_sup_tokens x --sup_batch_size x --comment test_tssa_1_x (where x = 256, 512, 1024)
  • TSSA-k: python test.py --method tssa --inner_learning_rate 5e-4 --num_of_updates 16 --num_sup_tokens x --comment test_tssa_k_x (where x = 256, 512, 1024)
  • TSSA-Reptile: python test.py --method tssa --inner_learning_rate 5e-4 --num_of_updates 16 --num_sup_tokens x --model_load_dir 'Trained_Models/tssa_reptile/' --comment test_tssa_reptile_x (where x = 256, 512, 1024)
  • TSSA-FOMAML: python test.py --method tssa --inner_learning_rate 5e-4 --num_of_updates 16 --num_sup_tokens x --model_load_dir 'Trained_Models/tssa_fomaml/' --comment test_tssa_fomaml_x (where x = 256, 512, 1024)

To train the base model run: python train_base_model.py with default parameters

To meta-train with Reptile : python meta_train.py --train_method reptile --num_sup_tokens 512 --num_of_updates 32 --sup_def proj --inner_learning_rate 5e-5 --checkpoint_dir Models/tssa_reptile --comment train_val_tssa_reptile

To meta-train with FOMAML : python meta_train.py --train_method reptile --num_sup_tokens 1024 --num_of_updates 14 --checkpoint_dir Models/tssa_fomaml --comment train_val_tssa_fomaml

Disclaimer: In some versions of tf-nightly-gpu, you might get an error regarding the use of experimental_ref() for tqdm progress bar. In those cases just remove experimental_ref() and the script should run fine.

About

Code for "On-the-Fly Adaptation of Source Code Models using Meta-Learning"


Languages

Language:Python 100.0%