(Partial) Program Dependence Learning

Code fragments from developer forums often migrate to applications due to the code reuse practice. Owing to the incomplete nature of such programs, analyzing them to early determine the presence of potential vulnerabilities is challenging. In this work, we introduce NeuralPDA, a neural network-based program dependence analysis tool for both complete and partial programs. Our tool efficiently incorporates intra-statement and inter-statement contextual features into statement representations, thereby modeling program dependence analysis as a statement-pair dependence decoding task. In the empirical evaluation, we report that NeuralPDA predicts the CFG and PDG edges in complete Java and C/C++ code with combined F-scores of 94.29% and 92.46%, respectively. The F-score values for partial Java and C/C++ code range from 94.29%–97.17% and 92.46%–96.01%, respectively. We also test the usefulness of the PDGs predicted by NeuralPDA (i.e., PDG*) on the downstream task of method-level vulnerability detection. We discover that the performance of the vulnerability detection tool utilizing PDG* is only 1.1% less than that utilizing the PDGs generated by a program analysis tool. We also report the detection of 14 realworld vulnerable code snippets from StackOverflow by a machine learning-based vulnerability detection tool that employs the PDGs predicted by NeuralPDA for these code snippets.

Model Architecture for NeuralPDA

Dataset Links

Here are the links for datasets used in this paper:

Java dataset for intrinsic evaluation: link
C/C++ dataset for intrinsic evaluation: link
Java dataset for method-level vulnerability detection: link

Pre-Trained Model/Tokenizer Asset Links

Here are the links for pre-trained RobertaTokenizer objects for Java and C/C++: link

Pre-trained NeuralPDA model weights (w/o statement types): link

Getting Started with NeuralPDA

Run Instructions

$ python run.py [options]

Options:
  -h, --help            show this help message and exit
  --data_dir DATA_DIR   Path to datasets directory.
  --output_dir OUTPUT_DIR
                        The output directory where the model saves model checkpoints
  --lang {c,java}       Programming language.
  --expt_name EXPT_NAME
                        Name of experiment to log in Weights and Biases.
  --max_tokens MAX_TOKENS
                        Maximum number of tokens in a statement
  --max_stmts MAX_STMTS
                        Maxmimum number of statements
  --num_layers NUM_LAYERS
                        Number of layers for Transformer encoder
  --num_layers_stmt NUM_LAYERS_STMT
                        Number of layers for Self-Attention Network
  --forward_activation FORWARD_ACTIVATION
                        Non-linear activation function in encoder
  --hidden_size HIDDEN_SIZE
                        Hidden size of decoding MLP
  --intermediate_size INTERMEDIATE_SIZE
                        Dimensionality of feed-forward layer in Transformer
  --embedding_size EMBEDDING_SIZE
                        Dimensionality of encoder layers
  --num_heads NUM_HEADS
                        Number of attention heads
  --vocab_size VOCAB_SIZE
                        Vocabulary size
  --use_stmt_types      Use statement type information.
  --no_ssan             Do not use self-attention network for statement
                        encoding.
  --no_pe               Do not use statement-level position encoding.
  --no_tr               Do not use transformer encoder.
  --load_model_path LOAD_MODEL_PATH
                        Path to trained model: Should contain the .bin files
  --do_train            Whether to run training.
  --do_eval             Whether to run eval on the dev set.
  --do_eval_top_k       Whether to run eval on the partitioned dev set.
  --do_predict          Whether to predict on given dataset.
  --log_interval LOG_INTERVAL
  --train_batch_size TRAIN_BATCH_SIZE
                        Batch size per GPU/CPU for training.
  --eval_batch_size EVAL_BATCH_SIZE
                        Batch size per GPU/CPU for evaluation.
  --learning_rate LEARNING_RATE
                        The initial learning rate for Adam.
  --weight_decay WEIGHT_DECAY
                        Weight deay if we apply some.
  --dropout_rate DROPOUT_RATE
                        Dropout rate.
  --adam_epsilon ADAM_EPSILON
                        Epsilon for Adam optimizer.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Total number of training epochs to perform.
  --seed SEED           random seed for initialization

Sample Commands for Replicating Experiments:

Training

$ python run.py --data_dir ./datasets/ --output_dir ./outputs/intrinsic/java_8 --lang java --do_train --use_stmt_types --max_stmts 8 --expt_name intrinsic-java-8

Inference

$ python run.py --data_dir ./datasets/ --output_dir ./no_output --lang java --do_eval --use_stmt_types --max_stmts 8 --load_model_path ./outputs/intrinsic/java_8/Epoch_4/model.ckpt

Make Predictions

$ python run.py --lang java --do_predict --use_stmt_types --max_stmts 8 --load_model_path ./outputs/intrinsic/java_8/Epoch_4/model.ckpt

Working Demo:

$ python infer.py --lang java -i <path-to-input(s)> -o {json|html}

aashishyadavally / NeuralPDA