Java code embeddings from compiled class files for code similarity tasks

Summary

A novel and simple approach for generating source code embeddings for code similarity tasks.

This compiler-in-the-loop approach works by compiling the high level source code to a typed intermediate language. Here we demonstrate for Java using the JVM instruction set. For other languages such as C/C++, LLVM intermediate language could be used.

We take the instruction sequence in each method and generate k-subsequences of instructions.

Extra type information is attached to 'invoke' instructions: Function calls are abstracted using the parameter and return types and attached to invoke instructions.
Class name is attached to the 'new' instruction.
Parameter and return types from function definition are currently not used since they're not part of the instruction stream.

k-subsequences of instructions:

For k = 1 .. N (currently N = 2):
- We take every k-subsequence in the instruction sequence, generating a k-gram.

I experiment with 4 approaches:

Subsequence k-gram embeddings generated by 3 methods:
- Random walks on the control flow graph (CFG) of a method, similar to graph vertex embeddings.
- On the entire instruction sequence of a method without any path sensitivity.
- Multi-tasking learning jointly on the above 2 tasks.
TF-IDF-style method embeddings.

Path sensitive k-gram embeddings

Here, we generate path sequences via random walks on the control flow graph (CFG). If number of paths are small, complete walks are performed. We then learn k-gram embeddings from these path sequences using a Skip-Gram model, similar to graph vertex embeddings.

Method embeddings are generated by summing up all its path embeddings and path embeddings are generated by summing up all the k-gram embeddings in the path.

Method similarity checking is done by computing vector similarity on method embeddings.

Path insensitive k-gram embeddings

In this approach, embeddings for subsequence n-grams (of instruction sequences) are learnt using a Word2Vec-style skip-gram model (currently n <= 2). Method embeddings are generated by summing up the embeddings of subsequence n-grams contained in it.

Method similarity checking is done by computing vector similarity on method embeddings.

Multi-Task Learning (MTL) of k-gram embeddings

Here, k-gram embeddings are learnt by a Multi-Task Skip-Gram Model that jointly optimizes on the above two tasks: the path sensitive learning and path insensitive learning.

TF-IDF embeddings

In this approach, during the learning phase, the IDF values for the features are learnt and stored in a JSON file.

During similarity checking, the TF vectors are generated and scaled using the previously learnt IDF values. Cosine similarity is used as the similarity measure.

Pre-requisites

A recent version of Python 3.
A recent version of JDK (javap is used to generate JVM disassembly) - must be in the path.
scikit-learn: pip install scikit-learn
pytorch: pip install torch (not required for TF-IDF embeddings)

Running (n-subsequence embeddings)

Embedding generation (path insensitive)

python compute_nsubseqs.py <folder containing class files> <subseq output path>
cd word2vec
python trainer.py <subseq output path> <vec output file path>

Embedding generation (path sensitive)

python cfg_embedding.py <folder containing class files> <subseq output path>
cd word2vec
python trainer.py <subseq output path> <vec output file path>

MTL Embedding generation

cd word2vec-mtl
python mtl-trainer.py <path insensitive subseq output path> <path sensitive subseq output path> <vec output file path>

Similarity checking

python compute_nsubseq_emb_similarity.py <folder containing class files> <vec file path>

IDF generation:

python compute_idf.py <folder containing class files> <IDF output path>

The folder containing class files is recursively searched for class files and the IDF is computed by aggregating data from all methods in all the class files.

Similarity checking

python compute_tf_idf_similarity.py <folder containing class files> <IDF path>

To run against the test files using the pretrained vectors from commons-lang library:

cd test
javac *.java
cd ..
python compute_nsubseq_emb_similarity test test/commons_lang_ngrams.vec

Running (TF-IDF embeddings)

IDF generation:

python compute_idf.py <folder containing class files> <IDF output path>

The folder containing class files is recursively searched for class files and the IDF is computed by aggregating data from all methods in all the class files.

Similarity checking

python compute_tf_idfsimilarity.py <folder containing class files> <IDF path>

The IDF path must point to a previously computed IDF file. All the class files are read and pair-wise similarity of all methods are printed.

To run against the test files:

cd test
javac *.java
cd ..
python compute_tf_idf_similarity test test/idf_commons_lang.json

Pre-computed

The file test/idf_commons_lang.json contains IDF computed from all the class files in the Apache Commons Lang library.
The file test/commons_lang_ngrams.vec contains unary and binary-subsequence embeddings trained from all the class files in Apache Commongs Lang library.

Citing

If you are using or extending this work as part of your research, please cite as:

Poroor, Jayaraj, "Java code embeddings from compiled class files for code similarity tasks", (2021), GitHub repository, https://github.com/jayarajporoor/code_embedding

BibTex:

@misc{Poroor2021,
   author = {Poroor, Jayaraj},
   title = {Java code embeddings from compiled class files for code similarity tasks},
   year = {2021},
   publisher = {GitHub},
   journal = {GitHub repository},
   howpublished = {\url{https://github.com/jayarajporoor/code_embedding}}
}

Related work

A few deep learning models have been proposed in recent years to generate source code embeddings:

Code2Vec - https://github.com/tech-srl/code2vec
CodeBERT - https://github.com/microsoft/CodeBERT

yashk2000 / code_embedding

Java code embeddings from compiled class files for code similarity tasks

Summary

Path sensitive k-gram embeddings

Path insensitive k-gram embeddings

Multi-Task Learning (MTL) of k-gram embeddings

TF-IDF embeddings

Pre-requisites

Running (n-subsequence embeddings)

Embedding generation (path insensitive)

Embedding generation (path sensitive)

MTL Embedding generation

Similarity checking

IDF generation:

Similarity checking

Running (TF-IDF embeddings)

IDF generation:

Similarity checking

Pre-computed

Citing

Related work

About

Languages