koshinryuu / basilisk

BASILISK (Bootstrapping Approach to SemantIc Lexicon Induction using Semantic Knowledge)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to Use Basilisk:
====================

To use Basilisk to extract semantic lexicons you need first manually generate seeds for each semantic class, then prepare pattern extractions, and run Basilisk at the last. 

  (1) Select Seeds:

    It's better to select at least 10 seeds for each semantic class. Simply, the seeds are generated by sorting words in the whole corpus by frequency, and manually identify the 10 most frequent nouns that belong to each semantic category. 

    Seeds belong to same semantic class are stored in one separate file (each seed word per line). And all seed files are put in one same directory. 

  
  (2) Prepare Pattern Extractions:
      In our setting any pattern extractions could be used in Basilisk. But each line of the extraction file should be like this : 
      'extractedNoun   *  extractionPattern '.

      Here we give a example about how to generate pattern extractions using Stanford Dependency Parser. 
      Raw data file:  test.txt
      Use Stanford Dependency Parser to generate dependency file :  test.parse
      Then use extractionFormatConvertor.py to convert to extraction file that could be used in Basilisk.      

  (3) Run Basilisk
      
      Command line:

      java -jar BASILISK.jar  seed_slists    extractions_file   stopwords_file     [(options) (flags)]


      seed_slists:
                    An 'slist' file must have a directory path on the first
                    line, followed by the names of individual files found
                    in that directory. The 'seed_slists' file should list
                    the files containing the seed words for each semantic
                    category to be learned. When running in single category
                    mode, the slist file should only list a single seed
                    file. When running in multiple category mode, the slist
                    file should list two or more seed files.
       
                    Example content of seed slist:
                    -----------------------------
                    |seeds/terrorism/           |
                    |human.seeds                |
                    |vehicles.seeds             |
                    |weapon.seeds               |
                    -----------------------------


    Options:
          -n  [num_iterations]        Number of iterations to run basilisk for. 
                                      Default value is 5.
    
          -c  [0 or 1]                0: simple conflict resolution; 1: improved conflict resolution
                                      Default is improved conflict resolution
    
          -o  [directory]             Output directory.
                                      Default is the root directory.
    
          -s                          Runs Basilisk in "Snowball" mode. That is, Basilisk will only select
                                      the top scorer from each category.
                                      Default is to not use this feature.
                                                
          -t                          Also writes a trace file outlining the words and their scoring during each
                                      iteration of basilisk.       

                    







About

BASILISK (Bootstrapping Approach to SemantIc Lexicon Induction using Semantic Knowledge)

License:GNU General Public License v2.0


Languages

Language:Python 100.0%