SplitRegex: Regular Expression Synthesis via Divide-and-Conquer Approach

SplitRegex is a divided-and-conquer framework for learning target regexes; split (=divide) positive strings and infer partial regexes for multiple parts, which is much more accurate than the whole string inferring, and concatenate (=conquer) inferred regexes while satisfying negative string.

This repo implement the SplitRegex framework, and dataset for experiments.

Setting

prefer Python 3.9.7

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
python setup.py install
cd submodels  
git submodule update --init --recursive
cd ..

Data

sh shell_script/data_generate.sh

Download and transform raw data to usable form.
Random dataset contains size of 2, 4, 6, 8, 10.
Practical dataset contains 'Snort', 'Regexlib', and 'Polyglot'. We replace some quantifiers with kleene star and character sets with customed alphabet.
Data is given as (20pos, 20neg, 20label, regular expression).

Example

Regular expression : $a^* b^? a$

String	Labelled string
aaab	0001
aaba	0012
ba	12
aaa	002

Split Model (train.py)

sh shell_script/practical_train.sh
sh shell_script/random_train.sh

Description

Generating set of labeled strings from set of strings by spliting each string to determine the boundaries of sub expression.
Data is given as (10pos, 10label, regular expression).
Saving trained model with the form of 'model.pt' in saved_models/.
Acc means accuracy between data and prediction, while Acc (RE) means accuracy between sub regular expression and prediction.

Overall Synthesis Architecture (synthesis.py)

sh shell_script/synthesis.sh

Description

Inferring the regex from set of positive strings and set of negative strings.
Data is given as (10pos, 10neg, regular expression).
Compare divide-and-conquer approach and naive synthesis approach in terms of time and success rate.
Synthesis output will be stored in log_data/

Synthesis process

split each positive string and negative string using the trained split model.
generate subregex from substrings by the one of submodels.
make regex by concatenating the subregexes.

Show the result (debug.py)

sh shell_script/debug.sh

Acknowledgment

This product includes software (seq2seq base model) developed at https://github.com/IBM/pytorch-seq2seq
This product use fado module from https://github.com/0xnurl/fado-python3
This product refers to set2regex module from https://github.com/woaksths/set2regex

mrseongminkim / SplitRegex

SplitRegex: Regular Expression Synthesis via Divide-and-Conquer Approach

Setting

Data

Example

Split Model (train.py)

Description

Overall Synthesis Architecture (synthesis.py)

Description

Synthesis process

Show the result (debug.py)

Acknowledgment

License

About

Languages