keiouok / naturalcc

NaturalCC: A Toolkit to Naturalize the Source Code Corpora

Home Page:http://xcodemind.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NaturalCC v0.5.0

NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models for many software engineering tasks, e.g., code summarization, code retrieval, code completion, code clone detection and type inference. Our vision is to bridge the gap between programming language and natural language through machine learning techniques.

Version


Features

  • A collection of code corpus with data preprocessing
  • Performance benchmark
  • Mixed precision training
  • Multi-gpu training

Dataset

Currently, we have processed the following datasets:

Requirements

  • PyTorch version >= 1.4.0
  • Python version >= 3.6
  • GCC/G++ > 5.0
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • For faster training install NVIDIA's apex library with the --cuda_ext and --deprecated_fused_adam options

Installation

1. Half precision computation

NaturalCC supports half precision training.

  • If your Pytorch.__version__ < 1.6.0 and nvcc -V is runnable, install apex.
  • Otherwise, use Pytorch to build half precision module torch.cuda.amp. This will be supported in the future.

2. Install other prerequisite libraries

git clone https://github.com/xcodemind/naturalcc
cd naturalcc
pip install -r requirements.txt

3. Build or install NaturalCC

Export your NaturalCC cache directory (data and models will be saved in this directory) to user variables(~/.bashrc or ~/.zshrc).

echo "export NCC=[directory to save NaturalCC data/models ]" >> ~/.bashrc

Note: PyCharm may not get environment variables and thus we recommend you to register your NCC variable at ncc/__init__.py*

Compile Cython files to accelerate programs and register NaturalCC into your pip list

# compile for debug
# python setup.py build_ext --inplace
# install 
pip install --editable ./

Since NCC is build via Cython, your GCC/G++ version should be greater than 4.9. If you have the root permission, update GCC/G++; otherwise, install GCC/G++ with conda.

# install GCC/G++ with conda
conda install -c anaconda gxx_linux-64
conda install -c conda-forge gcc_linux-64
cd ~/anaconda/envs/XXX/bin
ln -s x86_64-conda_cos6-linux-gnu-gcc gcc
ln -s x86_64-conda_cos6-linux-gnu-g++ g++
# check
conda deactivate
conda activate XXX
>> type "gcc/g++ -v" in terminals

Examples

All the running commands here should be executed in the root of project folder (the path of your naturalcc). For example, in my environment I will stay at /data/wanyao/Dropbox/ghproj-v100/naturalcc.

Pre-trained models and logs

BLEU-4 METEOR ROUGE-L Cost Logs
Seq2Seq+Attn 25.57 14.40 39.41 0.09s/b click here
Tree2Seq+Attn 23.35 12.59 36.49 0.48s/b click here
Transformer 30.64 17.65 44.59 0.26s/b click here
Transformer+RPE 31.57 17.74 45.18 0.27s/b click here
PLBART 32.71 18.13 46.05 0.80s/b TBC
Go Java JS PHP Python Ruby Cost Logs
NBOW 66.59 59.92 47.15 54.75 63.33 42.86 0.16s/b click here
ConV1d 70.87 60.49 38.81 61.92 67.29 36.53 0.30s/b click here
BiRNN 65.80 48.60 23.23 51.36 48.28 19.35 0.74s/b click here
SelfAttn 78.45 66.55 50.38 65.78 79.09 47.96 0.25s/b click here
Attr Num Name Param Tokens Cost Logs
LSTM 51.67 47.45 46.52 66.06 73.73 0.31s/b click here
GTP-2 70.37 62.20 63.84 73.54 82.17 0.43s/b click here
TravTrans 72.08 68.55 76.33 71.08 83.17 0.43s/b click here

Type Inference

Acc@1 (All types) Acc@5 (All types) Acc@1 (Any types) Acc@5 (Any types) Cost Logs
DeepTyper 0.52 0.67 0.43 0.67 0.42s/b TBC
Transformer 0.32 0.64 0.37 0.75 0.85s/b TBC

License and Acknowledgement

NaturalCC is MIT-licensed. The license applies to the pre-trained models as well. This project is also highly inspired by Fairseq and AllenNLP.

Citation

Please cite as:

under reviewing

About

NaturalCC: A Toolkit to Naturalize the Source Code Corpora

http://xcodemind.github.io

License:MIT License


Languages

Language:Python 98.8%Language:C++ 0.5%Language:Cuda 0.5%Language:Shell 0.2%