Code to the Paper "Automatic Detection of Semantic Primitives Using Optimization Based on Genetic Algorithm"

Setup

Clone the repository:

git clone https://github.com/YevhenKost/SemPrimsDetectionGA.git

Install requirements

pip install -r requirements.txt

Fill the configs
1. Pagerank model fitting parameters (conf/params_pagerank.json). The parameters description can be found via the following link: PageRank
2. Word Vectorization paths and save names (conf/vectorization_configs.py). For each vectorizer provide required model paths on your local machine.

Usage

Prepare the dictionary in a following format and save to json file. Making a specific directory for the dictionary to store all the results is suggested. For example:

import json, os

# load dictionary
my_dict = {
    "cat": [
        {"definition": "a very cute animal"},
        {"definition": "makes muuuuuur"}
    ],
    "buy": [
        {"definition": "exchange something for a money"}
    ]
}

# save to the dir
SAVE_DIR = "cat_but_directory/"
os.makedirs(SAVE_DIR, exist_ok=True)
with open(os.path.join(SAVE_DIR, "dictionary.json"), "w") as f:
   json.dump(my_dict, f)

Convert dictionary to directed graph. It can be achieved via the command (paths are taken from the previous example):

python dict2graph.py --word_dictionary_path cat_but_directory/dictionary.json --stanza_dir LOADED_STANZA_MODELS/en --stanza_lang en --stop_words_lang english --save_dir cat_but_directory/ --drop_self_cycles true --lemm_always false

The arguments required:

--word_dictionary_path: path to the dictionary saved in json format (see previous example)
--stanza_dir: For lemmatization the stanza package is used. Can be "", than the stanza package will download everything based on the language, given in --stanza_lang. For model details, see Pipeline.
--stanza_lang: Language of dictionary. List of available languages can be found here.
--stop_words_lang: Stop words language to use. List of available languages can be found here.
--save_dir: path to a directory, where the grapg dict files will be stored: word encoding dictionary and graph edges dictionary in json formats. Suggested approach is to use the same directory as for the graph.
--drop_self_cycles: boolean, whether to delete the definitions, which have a word they suppose to define. For example, for the word "bark" the definition "to bark" will not be used during graph building.
--lemm_always: boolean, whether to use lemmatization only if the word is not in dictionary vocabulary or always.
--vocabulary_list_path (Optional): str, path to json with vocabulary to use for graph building. If None, the keys from file from word_dictionary_path will be used.
--lemm_vocabulary (Optional): boolean, Ignored if vocabulary_list_path is empty. Whether to lemmatize words in vocabulary_list. The duplicates will be removed.

For more details:

python dict2graph.py -h

Run Generation of Permutation-based Semantic Primitives Sets:

python sp_generation.py --load_dir cat_but_directory/ --N 1000 --n_cores 12 --seed 2

Note, that it could take a while. For example, for a wordnet dictionary the generation of 1,000 SP lists took around a week with multiprocessing. The command execution will save in the --load_dir a generated lists in the following format and filename:

sp_sets_format = [
   [1,2,3], # sp set
   [10,2,5] # sp set
]

filename = f"candidates_{str(N)}_random{str(seed)}.json" # N and seed are taken from the arguments

The arguments required:

--load_dir: path to directory, which contains graph.json file (generated on previous step). The generated SP lists will be saved here.
--N: int, number of SP lists to generate (there is no gurantee that they will be all unqiue).
--n_cores: int, how many cores to use during multiprocessing.
--seed: int, fix random seed.

For more details:

python sp_generation.py -h

Fit PageRank model

python page_rank.py --load_dir cat_but_directory/ --fit_params_path conf/params_pagerank.json

The fitted model will be saved to --graph_path.

The arguments required:

--load_dir: path to directory, which contains graph.json file (generated on the first step). In this directory the trained pagerank model will be saved.
--fit_params_path: path to json file with pagerank parameters. See conf/params_pagerank.json

For more details:

python page_rank.py -h

Run algorithm

python run.py --load_dir cat_but_directory/ --sp_gen_lists_path cat_but_directory/candidates_1000_random2.json --n_threads 8 --val_prank_fill -1.0 --pop_size 100 --card_diff 50 --card_upper 2800 --save_dir GA_fit_model

Algorithm results will be saved to save_dir. See https://pymoo.org/interface/result.html. The decoded results will be stored in save_dir/sp_wordlists/.

The arguments required: * --load_dir: str, path to directory, which contains graph.json, encoding_dict.json and pagerank.pickle files (generated on previous steps). * --chp_path: str (optional), path to .npy checkpoint (if you want to continue training). After the model training this checkpoint will be saved in the save_dir. * --n_threads: int, number of cores to use for multiprocessing. * --sp_gen_lists_path: str, path to json file with stored generated SP lists (see step 3). * --val_prank_fill: negative float, value to use to return for mean pagerank objective function if the cycle is still detected in the graph. * --pop_size: int, population size (see [here](https://pymoo.org/algorithms/soo/ga.html#nb-ga)) * --card_diff: int, maximum possible cardinality deviation (constraint function: f(X) = (X - card_mean) ** 2 <= card_diff ** 2). * --card_mean: int, mean cardinality for the constraint (constraint function: f(X) = (X - card_mean) ** 2 <= card_diff). * --max_mutate: int, maximum number of elements to mutate per population. Default: 60. * --min_mutate: int, minimum number of elements to mutate per population. Default: 0. * --n_max_gen: int, maximum number of iterations to fit algorithm. Default: 30. * --save_dir: path, where training args, checkpoint and results will be stored.

For more details see:

python run.py -h

Testing

Prepare word lists
Create a dir, where each word list should be in a text file with newline separated word
Fill up the preprocessing configs Before that fill up the conf/vectorization_configs.py and word_preprocessing_utils.py files. word_preprocessing_utils.py supports preprocessing for English, Spanish and Ukrainian at the moment, but it is possible to add new classes for other langs. In conf/vectorization_configs.py fill up the stemming/lemmatization fields with the suitable classes.
Vectorize target word lists

python vectorize_words.py --lists_dir wordlists/ --save_dir wordlists/embeddings/

The arguments required:

--lists_dir: path to directory, which contains word lists (see Section 1).
--save_dir: path, where the embeddings should be saved. Will generate a directory for each wordlist with the same name as file. In each dir the embeddings in .npy format will be saved.

Vectorize obtained word lists (see Section 5 of Usage)

python vectorize_words.py --lists_dir GA_fit_model/sp_wordlists --save_dir GA_fit_model/sp_embeddings/

Calculate and save metrics

python evaluate.py --pred_wordlist_embeddings_dir GA_fit_model/sp_embeddings --target_wordlist_dir wordlists/embeddings/ --save_dir GA_fit_model/ --metric cosine

The arguments required:

--pred_wordlist_embeddings_dir: path to directory, where the embeddings for generated populations are stored (see previous step).
--target_wordlist_dir: path to directory, where the embeddings for target word lists are stored (see step 3).
--metric: metric to use. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html metric argument.
--save_dir: path, where the metrics should be saved. The json file will be generated: metrics_metric.json, where metric is the specified one.

YevhenKost / SemPrimsDetectionGA

Code to the Paper "Automatic Detection of Semantic Primitives Using Optimization Based on Genetic Algorithm"

Setup

Usage

Testing

About

Languages