Tree Sitter Multi Codeview Generator

Tree Sitter Multi Codeview Generator aims to generate combined multi-code view graphs that can be used with various types of machine learning models (sequence model neural networks, graph neural networks, etc). It is also designed to be easily extended to various source code languages. tree-sitter is used for parsing which is highly efficient and has support for over 40+ languages. Currently, this repository supports codeviews for Java in over 40 possible combinations of codeviews. It has been structured such that support for other languages can be easily added. If you wish to add support for more languages, please refer to the contributing guide.

Comex

comex is a rebuild of Tree Sitter Multi Codeview Generator for easier invocation as a Python package. This rebuild also includes a cli interface for easier usage. It isolates the logic pertaining to the generation and combination of codeviews to better differentiate tasks involved in the IBM OSCP Project.

Installation

comex is published on the Python Registry and can be easily installed via pip:

pip install comex

Note: You would need to install GraphViz(dot) so that the graph visualizations are generated

To setup comex for development using the source code in your python environment:

pip install -r requirements-dev.txt

Note: Please clone recursively so sub-modules are setup correctly

git clone --recursive {...}

This performs an editable install, meaning that comex would be available throughout your environment (particularly relevant if you use conda or something of the sort). This means now you can interact and import from comex just like any other package while remaining standalone but also reflecting any code side updates without any other manual steps

Usage as a CLI

This is the recommended way to get started with comex as it is the most user friendly

The attributes and options supported by the CLI are well documented and can be viewed by running:

comex --help

For example, to generate a combined CFG and DFG graph for a java file, you can run:

comex --lang "java" --code-file ./test.java --graphs "cfg,dfg"

Usage as a Python Package

The comex package can be used by importing required drivers as follows:

from comex.codeviews.combined_graph.combined_driver import CombinedDriver

CombinedDriver(
    src_language=lang,
    src_code=code,
    output_file="output.json",
    graph_format=output,
    codeviews=codeviews
)

In most cases the required combination can be obtained via the combined_driver module as shown above.

src_language: denotes one of the supported languaged hence currently "java" or "cs"

src_code: denotes the source code to be parsed

output_file: denotes the output file to which the generated graph is written

graph_format: denotes the format of the output graph. Currently supported formats are "dot" and "json". To generate both pass "all"

codeviews: refers to the configuration passed for each codeview

Output Example:

Combined simple AST+CFG+DFG for a simple Java program that finds the maximum among 2 numbers:

Code Organization

The code is structured in the following way:

For each code-view, first the source code is parsed using the tree-sitter parser and then the various code-views are generated. In the tree_parser directory, the Parser and ParserDriver is implemented with various funcitonalities commonly required by all code-views. Language-specific features are further developed in the language-specific parsers also placed in this directory.
The codeviews directory contains the core logic for the various codeviews. Each codeview has a driver class and a codeview class, which is further inherited and extended by language in case of code-views that require language-specific implementation.
The cli.py file is the CLI implementation. The drivers can also be directly imported and used like a python package. It is responsible for parsing the source code and generating the codeviews.

Testing

The repo is setup to automatically perform CI tests on making pulls to main and development branches. To test locally:

Run specific test

Say you wish to run test_cfg function
Drop the '[...]' part to run all tests in a file
- formatted as [extension-filename]
no-cov prevents coverage report from being printed

pytest -k 'test_cfg[cs-test7]' --no-cov

Run all tests and get coverage report

pytest

Analyze the deviation report given by deepdiff by using the verbose output. This will help quickly figure out difference from the gold file

pytest -k 'test_cfg[cs-test7]' --no-cov -vv

Publishing

Make sure to bump the version in setup.cfg.

Then run the following commands:

rm -rf build dist
python setup.py sdist bdist_wheel

Then upload it to PyPI using twine (pip install twine if not installed):

twine upload dist/*

About the IBM OSCP Project

This tool was developed for research purposes as a part of the OSCP Project. Efficient representation of source code is essential for various software engineering tasks using AI pipelines such as code translation, code search and code clone detection. Code Representation aims at extracting the both syntactic and semantic features of source code and representing them by a vector which can be readily used for the downstream tasks. Multiple works exist that attempt to encode the code as sequential data to easily leverage state of art NN models like transformers. But it leads to a loss of information. Graphs are a natural representation for the code but very few works(MVG-AAAI’22) have tried to represent the different code features obtained from different code views like Program Dependency Graph, Data Flow Graph etc. as a multi-view graph. In this work, we want to explore more code views and its relevance to different code tasks as well as leverage transformers model for the multi-code view graphs. We believe such a work will help to

Establish influence of specific code views for common tasks
Demonstrate how graphs can combined with transformers
Create re-usable models

Team

This tool is based on the ongoing joint research effort between IBM and Risha Lab at IIT Tirupati to explore the effects of different code representations on code based tasks involving:

IBM / tree-sitter-codeviews