rjagerman / mammoth

Mammoth: Web-scale topic modelling for the ClueWeb dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mammoth

Web-scale topic modelling for the ClueWeb12 dataset using Spark and Glint. This project is a continuation of the Web Scale Data Processing and Mining Project.

Downloading the 1000-topic LDA model

We have used this software to train a 1000-topic LDA model on the full ClueWeb12 data set using a truncated vocabulary of 100,000 terms. The model is publicly available for download here (gzipped: 609MB, uncompressed: 2GB). The file structure looks something like this:

0 0.001 0.002 0.00001 0.3 0.001
1 0.500 0.698 0.99998 0.1 0.899
2 0.499 0.3   0.00001 0.2 0.1

The features are represented as rows and the topics as columns. Each row's first column states the feature number (corresponding with the dictionary, which can be found here). The remaining columns represent the probability of that feature for the respective topic. The probabilities per column add up to 1.

Compiling

To compile Mammoth you will need to use sbt, which will take care of all dependencies. You can compile the application from the repository directory by running:

sbt compile assembly

Typically you will want to execute the application on a spark cluster. You can copy the compiled jar file to a remote machine by using the scp command:

scp target/scala-2.10/Mammoth-assembly-0.1.jar username@server:/path/to/Mammoth-assembly-0.1.jar

Running

Mammoth is a spark application and therefore needs to be executed by a spark cluster. You can run the spark-submit command and submit the jar file that was generated by the sbt compile assembly command. You can specify additional command-line options to customize the behavior of the application:

Mammoth 0.1
Usage:  [options]

  -d <value> | --dataset <value>
        The directory where the dataset is located
  -r <value> | --rdd <value>
        The (optional) RDD vector data file to load (if it does not exist, it will be created based on the dataset)
  --dictionary <value>
        The dictionary file (if it does not exist, a dictionary will be created there)
  -i <value> | --initial <value>
        The file containing the topic model to initialize with (leave empty to start from a random topic model)
  -f <value> | --final <value>
        The file where the final topic model will be stored
  -c <value> | --glintConfig <value>
        The glint configuration file
  -s <value> | --seed <value>
        The random seed to initialize the topic model with (ignored when an initial model is loaded, default: 42)
  -t <value> | --topics <value>
        The number of topics (ignored when an initial model is loaded, default: 30)
  -v <value> | --vocabulary <value>
        The (maximum) size of the vocabulary (ignored when an initial model is loaded, default: 60000)
  -b <value> | --blocksize <value>
        The size of a block of parameters to process at a time (default: 60000)
  -α <value> | --alpha <value>
        The (symmetric) α prior on the topic-document distribution (default: 0.5)
  -β <value> | --beta <value>
        The (symmetric) β prior on the topic-word distribution (default: 0.01)
  -τ <value> | --tau <value>
        The SSP delay bound (default: 1)
  -g <value> | --globalIterations <value>
        The number of global iterations (default: 20)
  -l <value> | --localIterations <value>
        The number of local iterations (default: 5)
  -p <value> | --partitions <value>
        The number of partitions to split the data in (default: 336)

Here is an example that uses 80000 vocabulary terms, 50 topics, 100 iterations and 1000 partitions:

spark-submit --master spark://master-url:7077 Mammoth-assembly-0.1.jar -v 80000 -d "/path/to/dataset/*" -p 1000 -t 50 -i 100 --dictionary "/path/to/dictionary.txt" -c "/path/to/glint.conf"

If you want to tweak the amount of executor memory, driver memory or number of computing cores, you can pass these parameters to the spark-submit command. A long list of configurable spark parameters can be found here. Following is the same example as above, but with the amount of memory and cores specified:

spark-submit --master spark://master-url:7077 --executor-memory 100G --driver-memory 100G --total-executor-cores 368 Mammoth-assembly-0.1.jar -v 80000 -d "/path/to/dataset/*" -p 1000 -t 50 -i 100 --dictionary "/path/to/dictionary.txt" -c "/path/to/glint.conf"

Troubleshooting

Check out the wiki for known issues with running this application on a computing cluster.

About

Mammoth: Web-scale topic modelling for the ClueWeb dataset

License:MIT License


Languages

Language:Scala 100.0%