SplitRegex is a divided-and-conquer framework for learning target regexes; split (=divide) positive strings and infer partial regexes for multiple parts, which is much more accurate than the whole string inferring, and concatenate (=conquer) inferred regexes while satisfying negative string.
This repo implement the SplitRegex framework, and dataset for experiments.
- prefer Python 3.9.7
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
python setup.py install
cd submodels
git submodule update --init --recursive
cd ..
sh shell_script/data_generate.sh
- Download and transform raw data to usable form.
- Random dataset contains size of 2, 4, 6, 8, 10.
- Practical dataset contains 'Snort', 'Regexlib', and 'Polyglot'. We replace some quantifiers with kleene star and character sets with customed alphabet.
- Data is given as (20pos, 20neg, 20label, regular expression).
String | Labelled string |
---|---|
aaab | 0001 |
aaba | 0012 |
ba | 12 |
aaa | 002 |
sh shell_script/practical_train.sh
sh shell_script/random_train.sh
- Generating set of labeled strings from set of strings by spliting each string to determine the boundaries of sub expression.
- Data is given as (10pos, 10label, regular expression).
- Saving trained model with the form of 'model.pt' in saved_models/.
- Acc means accuracy between data and prediction, while Acc (RE) means accuracy between sub regular expression and prediction.
sh shell_script/synthesis.sh
- Inferring the regex from set of positive strings and set of negative strings.
- Data is given as (10pos, 10neg, regular expression).
- Compare divide-and-conquer approach and naive synthesis approach in terms of time and success rate.
- Synthesis output will be stored in log_data/
- split each positive string and negative string using the trained split model.
- generate subregex from substrings by the one of submodels.
- make regex by concatenating the subregexes.
sh shell_script/debug.sh
- This product includes software (seq2seq base model) developed at https://github.com/IBM/pytorch-seq2seq
- This product use fado module from https://github.com/0xnurl/fado-python3
- This product refers to set2regex module from https://github.com/woaksths/set2regex