How to Use Basilisk: ==================== To use Basilisk to extract semantic lexicons you need first manually generate seeds for each semantic class, then prepare pattern extractions, and run Basilisk at the last. (1) Select Seeds: It's better to select at least 10 seeds for each semantic class. Simply, the seeds are generated by sorting words in the whole corpus by frequency, and manually identify the 10 most frequent nouns that belong to each semantic category. Seeds belong to same semantic class are stored in one separate file (each seed word per line). And all seed files are put in one same directory. (2) Prepare Pattern Extractions: In our setting any pattern extractions could be used in Basilisk. But each line of the extraction file should be like this : 'extractedNoun * extractionPattern '. Here we give a example about how to generate pattern extractions using Stanford Dependency Parser. Raw data file: test.txt Use Stanford Dependency Parser to generate dependency file : test.parse Then use extractionFormatConvertor.py to convert to extraction file that could be used in Basilisk. (3) Run Basilisk Command line: java -jar BASILISK.jar seed_slists extractions_file stopwords_file [(options) (flags)] seed_slists: An 'slist' file must have a directory path on the first line, followed by the names of individual files found in that directory. The 'seed_slists' file should list the files containing the seed words for each semantic category to be learned. When running in single category mode, the slist file should only list a single seed file. When running in multiple category mode, the slist file should list two or more seed files. Example content of seed slist: ----------------------------- |seeds/terrorism/ | |human.seeds | |vehicles.seeds | |weapon.seeds | ----------------------------- Options: -n [num_iterations] Number of iterations to run basilisk for. Default value is 5. -c [0 or 1] 0: simple conflict resolution; 1: improved conflict resolution Default is improved conflict resolution -o [directory] Output directory. Default is the root directory. -s Runs Basilisk in "Snowball" mode. That is, Basilisk will only select the top scorer from each category. Default is to not use this feature. -t Also writes a trace file outlining the words and their scoring during each iteration of basilisk.