JayLi2018/RuleCleaner

This is the repository for RuleCleaner submitted to VLDB2024

To run code, please follow those steps:

run python setup.py install to install rulecleaner package under your environment (virtualenv is recommended)
locate under folder rbbm_src/, and use the following instructions to run code.

usage: main.py [-h] [-U] [-e EXPERIMENT_NAME] [-R REPEATABLE] [-x SEED] [-X SEED_FILE] [-E RETRAIN_EVERY_PERCENT] [-A RETRAIN_ACCURACY_THRESH] [-T] [-l] [-s] [-r] [-G STRATEGY]
               [-D DELETION_FACTOR] [-W DELETION_ABSOLUTE_THRESHOLD] [-b DELETION_TYPE] [-d DBNAME] [-P PORT] [-p PASSWORD] [-u USER] [-f LF_SOURCE] [-O NUMBER_OF_FUNCS] [-i] [-z] [-w] [-o]
               [-k LOAD_FUNCS_FROM_PICKLE] [-K PICKLE_FILE_NAME] [-M TRAINING_MODEL_TYPE] [-n] [-t] [--run-gpt-rules] [--gpt-dataset] [--gpt-pickled-rules-dir]

Running experiments of LFRepair

optional arguments:
  -h, --help            show this help message and exit
  -U, --use_case    use case of the run, is it for dc or lf? (default: dc)
  -e EXPERIMENT_NAME, --experiment_name EXPERIMENT_NAME
                        the name of the experiment, the results will be stored in the directory named with experiment_name_systime (default: test_blah)
  -R REPEATABLE, --repeatable REPEATABLE
                        repeatable? (default: true)
  -x SEED, --seed SEED  if repeatable, specify a seed number here (default: 123)
  -X SEED_FILE, --seed_file SEED_FILE
                        if repeatable, specify a seed number here (default: seeds.txt)
  -E RETRAIN_EVERY_PERCENT, --retrain_every_percent RETRAIN_EVERY_PERCENT
                        retrain over every (default: 1*100), the default order is sorted by treesize ascendingly
  -A RETRAIN_ACCURACY_THRESH, --retrain_accuracy_thresh RETRAIN_ACCURACY_THRESH
                        when retrain over every retrain_every_percent, the algorithm stops when the fix rate is over this threshold (default: 1)
  -T, --pre_filter_thresh 
                        prefilter those rules that have number of tuples involved in violations above this thresh (default: 0)
  -l, --log_level   loglevel: debug/info/warning/error/critical (default: debug)
  -s, --user_input_size 
                        user input size total (the complaint size is decided by user_input_size*complaint_ratio) and confirm size is decided by user_input_size-complaint_size(default: 20)
  -r, --complaint_ratio 
                        user input complaint ratio (default: 0.5)
  -G STRATEGY, --strategy STRATEGY
                        method used to repair the rules (naive, information_gain, optimal) (default: information_gain)
  -D DELETION_FACTOR, --deletion_factor DELETION_FACTOR
                        this is a factor controlling how aggressive the algorithm chooses to delete the rule from the rulset (default: 0.5)
  -W DELETION_ABSOLUTE_THRESHOLD, --deletion_absolute_threshold DELETION_ABSOLUTE_THRESHOLD
                        this is threshold for absolute tree size increase (default: 10)
  -b DELETION_TYPE, --deletion_type DELETION_TYPE
                        deletion type (ratio/absolute) (default: ratio)
  -d DBNAME, --dbname DBNAME
                        database name which stores the dataset, (default: label)
  -P PORT, --port PORT  database port, (default: 5433)
  -p PASSWORD, --password PASSWORD
                        database password, (default: 5432)
  -u USER, --user USER  database user, (default: postgres)
  -f LF_SOURCE, --lf_source LF_SOURCE
                        the source of labelling function (intro / system generate) (default: undefined)
  -O NUMBER_OF_FUNCS, --number_of_funcs NUMBER_OF_FUNCS
                        if if_source is selected as system generate, how many do you want(default: 20)
  -i, --run_intro
  -z, --run_amazon
  -w, --run_painter
  -o, --run_professor
  -k LOAD_FUNCS_FROM_PICKLE, --load_funcs_from_pickle LOAD_FUNCS_FROM_PICKLE
                        (flag indicating if we want to load functions from a pickle file default: false)
  -K PICKLE_FILE_NAME, --pickle_file_name PICKLE_FILE_NAME
                        (if load_funcs_from_pickle, then heres the pickle file name : placeholder_name)
  -M TRAINING_MODEL_TYPE, --training_model_type TRAINING_MODEL_TYPE
                        the model used to get the label: majority/snorkel (default: snorkel)
  -n, --dataset_name 
                        dataset used in the use case of labelling functions (default: youtube)
  -t, --table_name  the table name from database cr that you want to work with (default: tax)
  --run-gpt-rules
  --gpt-dataset       youtube/amazon/pt/pa
  --gpt-pickled-rules-dir

As an example, if you want to run the experiment on GPT generated rules on youtube dataset, you could run

python main.py -e test -U lf -r 0.5 -G information_gain -D 0 --run-gpt-rules --gpt-dataset youtube -n youtube

The full version of the paper is here

JayLi2018 / RuleCleaner

About

Languages