mrseongminkim / SplitRegex

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SplitRegex: Regular Expression Synthesis via Divide-and-Conquer Approach

SplitRegex is a divided-and-conquer framework for learning target regexes; split (=divide) positive strings and infer partial regexes for multiple parts, which is much more accurate than the whole string inferring, and concatenate (=conquer) inferred regexes while satisfying negative string.

This repo implement the SplitRegex framework, and dataset for experiments.


modelarchitecture



Setting

  • prefer Python 3.9.7
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
python setup.py install
cd submodels  
git submodule update --init --recursive
cd ..



Data

sh shell_script/data_generate.sh
  • Download and transform raw data to usable form.
  • Random dataset contains size of 2, 4, 6, 8, 10.
  • Practical dataset contains 'Snort', 'Regexlib', and 'Polyglot'. We replace some quantifiers with kleene star and character sets with customed alphabet.
  • Data is given as (20pos, 20neg, 20label, regular expression).

Example

Regular expression :

String Labelled string
aaab 0001
aaba 0012
ba 12
aaa 002



Split Model (train.py)

NeuralSplitter

sh shell_script/practical_train.sh
sh shell_script/random_train.sh

Description

  • Generating set of labeled strings from set of strings by spliting each string to determine the boundaries of sub expression.
  • Data is given as (10pos, 10label, regular expression).
  • Saving trained model with the form of 'model.pt' in saved_models/.
  • Acc means accuracy between data and prediction, while Acc (RE) means accuracy between sub regular expression and prediction.



Overall Synthesis Architecture (synthesis.py)

sh shell_script/synthesis.sh

Description

  • Inferring the regex from set of positive strings and set of negative strings.
  • Data is given as (10pos, 10neg, regular expression).
  • Compare divide-and-conquer approach and naive synthesis approach in terms of time and success rate.
  • Synthesis output will be stored in log_data/

Synthesis process

  1. split each positive string and negative string using the trained split model.
  2. generate subregex from substrings by the one of submodels.
  3. make regex by concatenating the subregexes.





Show the result (debug.py)

sh shell_script/debug.sh

Acknowledgment



License

License

About

License:Apache License 2.0


Languages

Language:Python 95.0%Language:Shell 3.7%Language:Jupyter Notebook 1.2%