EdwardRaff / JSAT

Java Statistical Analysis Tool, a Java library for Machine Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Example from raw text to LDA?

salamanders opened this issue · comments

I saw your old article on topic modeling http://jsatml.blogspot.com/2014/06/stochastic-lda.html
and was wondering how easy it was to take a text file of text-document-per-row and try out your LDA model? I've been trying mallet and gensim, but both are very specific for how you have to prepare your data. I was hoping for some snippet from your test.

I saw the update(List<Vec>)
I have a List<String> of documents (multiple sentences, plain text, with punctuation)
which I want to turn into a List<jsat.linear.Vec>
what is the best way to get from A to B?

It isn't quite DataSet from Loading-text-data-and-Spam-Classification

maybe HashedTextDataLoader.java

I checked out OnlineLDAsviTest.java but it didn't bridge the two.

Do you have a snippet that would test the topic modeling starting from a list of Strings?

You can base it off the code from the Loading-text example. Once you have the classification dataset you can get the list of datapoints vectors from the getDataVectors method.

I'm currently working on improving test case reliability and updating / adding some new examples, which will include an LDA example.

Fantastic! I think I'm 90% of the way there from your tip.
Maybe. :) Now I just need a way to see the results mapped to words, or stemmed words, to see if the topics make sense.

 public static void main(final String[] args) {

    final ClassificationTextDataLoader mydl = new ClassificationTextDataLoader(
        new StemmingTokenizer(new PaiceHuskStemmer(),
            new StopWordTokenizer(new NaiveTokenizer(), StopWordTokenizer.ENGLISH_STOP_SMALL_BASE))
        , new OkapiBM25()) {
      final ImmutableMap<String, Integer> LABELS = ImmutableMap.of("ham", 0, "spam", 1);
      final File textFile = new File("sms_spam_ham.tsv");

      @Override
      protected void setLabelInfo() {
        labelInfo = new CategoricalData(LABELS.size());
        LABELS.entrySet().stream().forEach(ent -> {
          labelInfo.setOptionName(ent.getKey(), ent.getValue());
        });
      }

      @Override
      public void initialLoad() {
        try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(textFile)));) {
          br.lines().map(s -> s.split("\t")).forEach(row -> {
            addOriginalDocument(row[1], LABELS.get(row[0]));
          });
        } catch (final IOException ex) {
          LOG.log(Level.SEVERE, null, ex);
        }
      }
    };

    final ClassificationDataSet cds = mydl.getDataSet();
    System.out.println("Data set loaded in");
    System.out.println(cds.getSampleSize() + " data points");
    System.out.println(cds.getNumNumericalVars() + " features");

    System.out.println("TOPC TIME");
    final ExecutorService ex = Executors.newFixedThreadPool(SystemInfo.LogicalCores);
    final OnlineLDAsvi lda = new OnlineLDAsvi();
    lda.model(cds, 10, ex);
    System.out.println("Done modeling.");
    System.out.println(lda.getTopicVec(0));

You should use simple counts when using LDA rather than Okapi or some other word weighting scheme. Stemming will also make it harder to interpret the results of LDA. I wouldn't use it in this case unless your corpus is quite small.

The TextDataLoader has a method getWordForIndex that maps the index back to the word. But if you do stemming you will get the stemmed word, which makes the interpretation harder.

That works. But maybe I'm doing something wrong. I grabbed a bunch of news from http://mlg.ucd.ie/datasets/bbc.html

But I get really strange duplication in my topics:

TOPIC 0
exel ntpc logistics jarvis walkington mischief pco nozaki yili evacuate khurmala bewlay contrack smfg mukesh slavery britannia aig icbc turkmenistan
TOPIC 1
southampton west tottenham blackburn sunderland spurs forest redknapp aston brentford swansea burnley manchester palace ham park nottingham villa leary bnp
TOPIC 2
exel ntpc logistics jarvis walkington mischief pco nozaki yili evacuate khurmala bewlay contrack smfg mukesh slavery britannia aig icbc turkmenistan
TOPIC 3
exel ntpc logistics jarvis walkington mischief pco nozaki yili evacuate khurmala bewlay contrack smfg mukesh slavery britannia aig icbc turkmenistan
TOPIC 4

final ClassificationTextDataLoader mydl = new ClassificationTextDataLoader(
    new StopWordTokenizer(new NaiveTokenizer(), StopWordTokenizer.ENGLISH_STOP_SMALL_BASE), new TfIdf()   
) {
  final ImmutableMap<String, Integer> LABELS = ImmutableMap.of("business", 0, "entertainment", 1, "politics", 2,
      "sport", 3, "tech", 4);

  @Override
  protected void setLabelInfo() {
    labelInfo = new CategoricalData(LABELS.size());
    LABELS.entrySet().stream().forEach(ent -> {
      labelInfo.setOptionName(ent.getKey(), ent.getValue());
    });
  }

  @Override
  public void initialLoad() {
    LABELS.keySet().forEach(label -> {
      final File[] textFiles = (new File(DATA_ROOT_FOLDER + label)).listFiles(new FilenameFilter() {
        public boolean accept(final File dir, final String name) {
          return name.toLowerCase().endsWith(".txt");
        }
      });
      Arrays.asList(textFiles).forEach(file -> {
        String content;
        try {
          content = Files.toString(file, Charsets.UTF_8);
          addOriginalDocument(content, LABELS.get(label));
        } catch (IOException e) {
          throw new RuntimeException(e);
        }

      });
    });
  }
};

final ClassificationDataSet cds = mydl.getDataSet();
System.out.println("Data set loaded in");
System.out.println(cds.getSampleSize() + " data points");
System.out.println(cds.getNumNumericalVars() + " features");

System.out.println("TOPC TIME");
final ExecutorService ex = Executors.newFixedThreadPool(SystemInfo.LogicalCores);
final OnlineLDAsvi lda = new OnlineLDAsvi();
lda.setEpochs(200); // GUESSING ON THE 200!  what is a good starter number?  Any way to know when it settles?
final int NUM_TOPICS = 100; // ALSO GUESSING ON THE 100, but I'm ok with this guess.
lda.model(cds, NUM_TOPICS, ex);
System.out.println("Done modeling.");

for (int i = 0; i < NUM_TOPICS; i++) {
  System.out.println("TOPIC " + i);
  final Vec topic = lda.getTopicVec(i);
  StreamSupport.stream(topic.spliterator(), false)
  .map(iv->{
        return new AbstractMap.SimpleEntry<>(mydl.getWordForIndex(iv.getIndex()), iv.getValue());
      })
  .sorted((iv1, iv2) -> Double.compare(iv2.getValue(), iv1.getValue()))
  .limit(20)
  .forEach(iv -> {
    System.out.print(iv.getKey()  + " ");
  });
  System.out.println();
}
  1. You should be using the WordCount weighting, not anything fancy like Okapi.
  2. This algorithms for LDA works best on large corpa (as the documentation says, 100k+). That corpus isn't even 3k, so you may have to do a lot more parameter tuning to get decent results.

There is now an example for this https://github.com/EdwardRaff/JSAT/wiki/Basic-Topic-Modeling-with-LDA , so I'm closing this issue.