Linguado

Linguado is a tool which compares the abstract syntax trees (AST) of two or more scripts to measure the similarity. The main goal intended for this tool is to detect two variants of the same malware.

Background

This tool was developed by Guzmán Cernadas Pérez (@DonCaralludo) working for BE:SEC (@BESEC_byEmetel). It was shown by Marcos Carro Fernández and Guzmán Cernadas Pérez at the VICON in april 2023.

Installation

From pypi:

pip3 install linguado

From repo:

git clone https://github.com/caralludo/linguado.git
cd linguado
pip3 install .

Usage

In order to execute this tool, you have to have two or more source codes in different files, and you have to know in which language they are made.

The help is as follows:

usage: main.py [-h] [-p PAR] [-o OUTPUT] Files [Files ...] Language

Linguado is a tool which compares the AST of two or more files. Created by Guzmán Cernadas Pérez (@DonCaralludo)
working for BE:SEC (@BESEC_byEmetel)

positional arguments:
  Files                 Files to analyze
  Language              Language of the files. Options: javascript, php, python2, python3, vba

options:
  -h, --help            show this help message and exit
  -p PAR, --par PAR     Changes the number of iterations of the Weisfeiler-Lehman algorithm (default: 3)
  -o OUTPUT, --output OUTPUT
                        Changes the base name of the output files (default: result.csv)

Examples

Compare two source codes made in python3:

linguado source1.py source2.py python3

Compare two or more files made in python3:

linguado source* python3

Compare two or more files made in python3 and change the number of iterations of the Weisfeiler-Lehman algorithm:

linguado source* python3 -p 10

Compare two source codes made in python3 and changing the output name:

linguado source1.py source2.py python3 -o output.csv

Available languages

For the moment, the tool can compare the following programming languages:

JavaScript
PHP
Python2
Python3
VBA

Adding new languages

To add a new language you have to do the following steps:

Install ANTLR
Create or obtain a grammar in ANTLR4 format.
Generate the files with the following command:

antlr4 -Dlanguage=Python3 *.g4

Save the files in a new folder in the path ./linguado/[language name]
Import the Lexer and Parser in the file linguado/main.py

from mygrammar.MyGrammarLexer import MyGrammarLexer
from mygrammar.MyGrammarParser import MyGrammarParser
from javascript.JavaScriptLexer import JavaScriptLexer
from javascript.JavaScriptParser import JavaScriptParser
from php.PhpLexer import PhpLexer
from php.PhpParser import PhpParser
from python2.Python2Lexer import Python2Lexer
from python2.Python2Parser import Python2Parser
from python3.Python3Lexer import Python3Lexer
from python3.Python3Parser import Python3Parser
from vba.vbaLexer import vbaLexer
from vba.vbaParser import vbaParser

Modify the dictionary in the file linguado/main.py putting the lerxer, the parser and the first rule of the grammar

    language_functions = {
        "javascript": [JavaScriptLexer, JavaScriptParser, "program"],
        "mygrammar": [MyGrammarLexer, MyGrammarParser, "first_rule"],
        "php": [PhpLexer, PhpParser, "htmlDocument"],
        "python2": [Python2Lexer, Python2Parser, "file_input"],
        "python3": [Python3Lexer, Python3Parser, "file_input"],
        "vba": [vbaLexer, vbaParser, "startRule"]
    }

Output

A possible output example could be:

Generating AST's
100%|██████████| 4/4 [00:03<00:00,  1.16it/s]
Calculating Weisfeiler-Lehman matrix
100%|██████████| 3/3 [00:00<00:00, 28.09it/s]
Checking isomorphism (igraph)
100%|██████████| 4/4 [00:02<00:00,  1.98it/s]
Weisfeiler-Lehman:
[[58162880. 58162880. 58162880. 58162880.]
 [58162880. 58162880. 58162880. 58162880.]
 [58162880. 58162880. 58162880. 58162880.]
 [58162880. 58162880. 58162880. 58162880.]]
Weisfeiler-Lehman (%):
[[100. 100. 100. 100.]
 [100. 100. 100. 100.]
 [100. 100. 100. 100.]
 [100. 100. 100. 100.]]
Mean: 58162880.0 , Standard deviation: +- 0.0 ,  0.0
Isomorphism test (igraph):
[[ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]

In each matrix, the columns represents each source code file ordered by name, and each row represents the source code file ordered by name. So, in each intersection is represented the comparation between the two files.

             source1.py  source2.py  source3.py  source4.py
source1.py [[ 58162880.   58162880.   58162880.   58162880.]
source2.py  [ 58162880.   58162880.   58162880.   58162880.]
source3.py  [ 58162880.   58162880.   58162880.   58162880.]
source4.py  [ 58162880.   58162880.   58162880.   58162880.]]

           source1.py  source2.py  source3.py  source4.py
source1.py [[ True        True        True        True]
source2.py  [ True        True        True        True]
source3.py  [ True        True        True        True]
source4.py  [ True        True        True        True]]

Also, the tool creates two csv files with the same information in the terminal.

Measuring similarity

Two codes will have the same abstract syntax tree if:

The isomorphism test matrix has a True in the intersection of the two sources.

Two codes will not have the same abstract syntax tree if:

The Weisfeiler-Lehman matrix has different values.

If the sources do not have the same abstract syntax tree, we can use the standard deviation to know if they are similar:

If the standard deviation is close to zero (less than 5%), then the sources will be very similar.
If the standard deviation is around the 20%, then could be a chance that the sources are sharing some code.
If the standard deviation is more than 50%, then the sources will not be the same.

Behavior

Generates the abstract syntax tree with ANTLR4.
From the abstract syntax tree generates a graph which we can work with.
Calculates the Weisfeiler-Lehman matrix.
Performs the isomorphism test (igraph).
Prints on the screen and writes in a CSV the results of the Weisfeiler-Lehman algorithm and the isomorphism test.