yhalpern/anchor-word-recovery

This a Python implementation of the paper 
"A Practical Algorithm for Topic Modeling with Provable Guarantees"

Sanjeev Arora (arora@cs.princeton.edu)
Rong Ge (rongge@cs.princeton.edu)
Yoni Halpern (halpern@cs.nyu.edu)
David Mimno (mimno@cs.princeton.edu)
Ankur Moitra (moitra@ias.edu)
David Sontag (dsontag@cs.nyu.edu)
Yichen Wu (yichenwu@princeton.edu)
Michael Zhu (mhzhu@princeton.edu)

ICML 2013.

####################################################

anchor-word-recover (AWR) is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by
the Free Software Foundation

AWR is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License
along with AWR; see the file gpl-3.0.txt.  If not, write to the Free
Software Foundation, 59 Temple Place - Suite 330, Boston, MA
02111-1307, USA.

####################################################

This is a work in progress. Any questions/comments/suggestions can be directed to:
Yoni Halpern: halpern@cs.nyu.edu
 
----------------------------------------------------
Running Environment:
    python 2.x
    numpy (for mathematical calculations)
    scipy (for sparse matrix manipulation)

a tutorial for installing python and the rest on macOS is here:
http://www.lowindata.com/2013/installing-scientific-python-on-mac-os-x/

----------------------------------------------------
DEMO:

a demo which downloads an example file from uci machine learning database and learns different models can be found in demo.sh
Some notes about the demo:
    20 topics is likely too few, so it is possible that the topics will be mixed together
    50+ topics should be good.
    Using KL as a loss is slower than L2 loss.

----------------------------------------------------

Simple Usage:

    input:  1-indexed word-document matrix in the UCI format 
            vocabulary file
            (see http://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ for examples of matrix and vocab files.) 
            list of stopwords named stopwords.txt
            
    1. transform document to sparse matrix format:
        python uci_to_scipy.py input_file output_file
    
    2. remove rare words and stopwords (script removes words that do not appear in at least N documents).
        python truncate_vocabulary.py sparse_matrix_file vocab_file N
        note: this implementation is fairly sensitive to rare words. In the demo we filter words that appear in fewer than 50 documents and only allow anchors if they appear in over 100 documents.

    3. learn K topics
        python learn_topics.py sparse_matrix_file settings_file vocab_file K loss output_prefix
    
        an example settings file can be found in settings.example
        loss can be L2 or KL
    
    outputs:
        Recovered A matrix
        Top words
        topic likelihoods
yhalpern / anchor-word-recovery

About

Languages