lineryang/swell

SpectralLearning Toolkit README Ver. 0.0.2
==========================================


The spectral learning toolkit contains implementation of various spectral learning algorithms. All the algorithms learn from some large amounts of unlabeled text (e.g. WSJ, NYT, Reuters etc.) and output a dictionary (context oblivious) or context sensitive mapping from each word in the text to a low dimensional, typically ~30-50 dimensional real valued vector. The dictionary (or context oblivious mapping) maps each word (type) to a vector e.g. "bank" will have a single vector associated with it for all occurrences of "bank" in the text, irrespective of the fact whether it referred to "river bank" or "JPM Chase bank". On the other hand, context sensitive mappings map each word (token) to a vector, and hence would map "river bank" and "JPM Chase bank" to differ vectors based on context.

The goal behind learning all these embeddings is that they should provide supplementary information in addition to a baseline set of features that one might use for a task. For example, if you're doing NER where the standard train/test sets are sections of CoNLL '03 data and your classifier is some discriminative classifier e.g. CRF, where its easy to add new features. Then you would have a baseline set of features for NER e.g. word identities, capitalization in a fixed window around the given word etc. Now in order to improve the performance of your NER system, you might want to add spectral embeddings learned above as extra features with the hope that they would provide some non-redundant information in addition to the baseline set of features and hence would give a boost in classification accuracy for the task concerned, NER in this case. 

With that in mind, the processing occurs in two stages. In the first stage we learn a dictionary (context oblivious) from large amounts of data (e.g. WSJ, NYT, Reuters etc.) and then in the second stage we "induce" embeddings/mappings for a given corpus e.g. CoNLL train/test set in the NER example. The first stage embeddings are necessarily context oblivious, since we don't use that corpus for some actual task e.g. NER. The second stage embeddings which are for a given task e.g. NER can be context oblivious or can be context insensitive. 

Now, you might ask that why don't we add even the CoNLL train/test sets to the unlabeled learning in first stage? In principle you can do that and that would certainly improve the performance over not doing so since now there won't be any out-of-vocabulary words, however I'd prefer not do it for the risk of overfitting. 

The basic experimental framework I described above is the one used in our NIPS 11 paper. You also might want to have a look at our papers.
"Multi-View Learning of Word Embeddings via CCA" 
Dhillon, Foster and Ungar, NIPS 11

"Two Step CCA: A new spectral method for estimating vector models of words"
Dhillon, Rodu, Foster and Ungar, ICML 12

Dependencies: SWELL relies on JEIGEN, a Java wrapper for the C++ Eigen library. You need to build it separately.

https://github.com/hughperkins/jeigen

=================================================
The algorithms.variants currently implemented in the toolkit are:

1). OSCCA, TSCCA- Refer ICML 12 paper.

2). LR-MVL embeddings. These are two algorithms similar to the one proposed in the NIPS 11 paper .

3). OSCCA and TSCCA for NGrams : These algorithms are exactly the same as above except that they expect bigrams, trigrams or 5-grams of text followed by count as input and not running text. See the "Input_Files" for the exact format of input. For the bigrams, the algorithm option should be "WvsR".
For trigrams and 5-grams all the 4 options "WvsR", "WvsL", "WvsLR", "TwoStepLRvsW" work. If the option is WvsR or WvsL for trigram/5-grams, the "W" is always the first word and "R"/"L" context are the remaining n-1 grams. For "WvsLR", "W" is the middle word and "L"/ "R" is its (n-1)/2 sized context on either side.  




The SpectralLearning code can be run via "ant" as "ant CCAVariants" etc. where all the targets (which are currently stable and working) are mentioned in the build.xml file. Since, the unlabeled data can be huge, so there might be memory issues. In Java the default heap size allotted is 4G, so you might want to increase that by adding "-Xmx" option in the relevant target in the "build.xml" file.  


Program Arguments (mentioned in build.xml) file.
===============================================
All the algorithms have some common set of arguments and they have the same meaning for each one of them.


"unlab-train": This is the option to run the 1st stage unsupervised learning part from some big corpora (e.g. WSJ, NYT, Reuters etc.). The location of data is specified by: "unlab-train-file". There are some preprocessed datasets in the correct format available on sobolev e.g. Reuters RCV (~205M tokens).

"train": This is the option to run the 2nd stage unsupervised learning part i.e. inducing embeddings on some smaller problem specific corpora e.g. CoNLL data for the NER problem . The location of data is specified by: "train-file".

You might want to run both of "train" and "unlab-train" together or separately. Typically, one would run "unlab-train" once and learn the dictionary from large corpora and then use/re-use that dictionary to induce embeddings for various train/test sets (smaller corpora) using "train" option. 

"output-dir-prefix": Name of the directory where you want to save the outputs of the algorithms including the embeddings. By default it saves the outputs in the "Output_Files" folder.



"algorithm:" This is a necessary option. It chooses the algorithm to run. Options are 


1). algorithm:CCARunningText:2viewWvsLR       (OSCCA) (Main method in: CCAVariants.java)
2). algorithm:CCARunningText:TwoStepLRvsW     (TSCCA) (Main method in: CCAVariants.java)
3). algorithm:CCARunningText:LRMVL1     (LRMVL1) (Main method in: CCAVariants.java)
4). algorithm:CCARunningText:LRMVL     (LRMVL2) (Main method in: LRMVL2.java)
5). algorithm:CCARunningText:2viewWvsR   (OSCCA with only Right Context) (Main method in: CCAVariants.java)
6). algorithm:CCARunningText:2viewWvsL   (OSCCA with only Left Context) (Main method in: CCAVariants.java)
7). algorithm:CCANGram:2viewWvsLR   (Main method in: CCAVariantsNGrams.java)
8). algorithm:CCANGram:2viewWvsR    (Main method in: CCAVariantsNGrams.java)
9). algorithm:CCANGram:2viewWvsL     (Main method in: CCAVariantsNGrams.java)
10). algorithm:CCANGram:TwoStepLRvsW  (Main method in: CCAVariantsNGrams.java)

"doc-separator:" For some algorithms we care about document boundaries and collect/aggregate statistics across different documents. So, we need to specify the document separator, currently, I am using "DOCSTART-X-0" as separator. Note that 3 of the 4 algorithms i.e. ContextPCA, LSA, LRMVLEmbed expect the input files to be running text with documents separated via the document separator. ContextPCANGrams works with different kind of input file. Its input file contains bigrams separated by space by their counts and needs no document separators. If you're using Google n-gram data then ContextPCANGrams might be the algorithm you might want to consider running. 

** Please look at sample input files in Input_Files/ folder and make sure that there are no additional or fewer newlines, tabs etc. in your data in addition to the ones that I have in the files there. The I/O of the code is not too robust so it might barf if you add extra newlines etc. and create documents of empty size and so on.**

"vocab-size": This option specifies the size of dictionary i.e. how many unique words (types) we want to learn the embeddings for, from the 1st stage of unlabeled data learning. Typical sizes that I used were 100k to 300k. However, there are not more than 50k words in typical english, so you might want to play around with this number or choose it based on some linguistic knowledge that you might have. The final dictionary size is "vocab-size +1" i.e. for all the other words less frequent words which don't make it to the "vocab-size" we learn an "Out Of Vocabulary (OOV)" vector, which is a generic embedding to use for an unseen word.

"hidden-state": This is the size of embeddings. Typical numbers vary from 100-200.


"eigen-dict-name": This is the name of the output dictionary after the 1st stage of learning. It is a matrix of size (num-vocab +1)* hidden-state.



Extra Algorithm specific options
================================


"contextSizeEachSide": All algorithms except n-grams based ones and LRMVL2  require this option. It specifies the amount of context on each size to consider while constructing the word by context matrix.

"sqrt-trans": Highly recommended to get good embeddings. It takes square root of word counts before running the algorithm. Works for all algorithms.

"smooths": E.g. 0.1,0.5 Comma separated smooth values to be used for LRMVL2


"num-grams": 2,3 or 5. All the n-gram algorithms require this option




**Lastly, I invite you to look at the code yourself if you are curious about how things are implemented or if something is misbehaving etc. Also, please feel free to report potential bugs or other suggestions. We plan to release the code sometime in the next few months so it will be good to get feedback.**

**Also note that there are dummy input files in the Input_Files folder, so the code is in "just push play" state. You can just run one of the algorithms with the specified options in build.xml and it should run. The generated output will be in Output_Files folder.**

So, just do a git clone. CD to that directory and type "ant" to build. Then build any target as "ant CCAVariantsOSCCA" etc. The embeddings will be in the output folder with name eigenDict*.


**This is a continuously evolving document and I have tried my best to ensure everything is factually correct, however there might be some discrepancies, so I will keep on updating this document. (3/14/14)**

cheers and best of luck!
dhillon@cis.upenn.edu
lineryang / swell

About