vgherard / sbo

Utilities for training and evaluating text predictors based on Stupid Back-off N-gram models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pruning the dictionary

enzedonline opened this issue · comments

More of a question about the usage guide than an issue:

In the starter guide, under Evaluating next-word predictions, it's mentioned about pruning the dictionary to remove seldom used words - could somebody clarify the ranking (is this in terms of word frequency in the corpus?) and how to go about pruning an SBO dictionary according to rank?

Thanks. Remarkable package! Thanks for your great work on this!

Hey there, thanks for the encouraging words!

Yes, the ranking is in terms of words frequency in the corpus. The two types of pruning supported by the sbo_dictionary() constructor let you:

  • Choose the N most frequent words in the corpus (where max_size = N)
  • Choose the most frequent words which provide a given % coverage of the corpus (set by the target argument)

Hope this helps.

Cheers!

Ah, ok, great thanks. Needs some pre-processing to achieve my route then - I'm using a custom dictionary created outside of the corpus to restrict suggestions from the corpus to only those in the custom dictionary. I'd was wandering if there was some sort of post process that could be done on the pred table to achieve this.

Maybe redundant for what I'm doing, I trained my model with 100,000 sentences scraped of blog and news sources and N=4, with full English dictionary (48K words), the pred table was still only 135MB. I wrestled with trying to build a word generator (rather than character-level generator) on TensorFlow, memory always killed that route when it got to building the feature matrix.

en_US.dic <- sbo_dictionary(readLines("./data/dict/en_US.dic") %>% gsub("/.+", "", .))
sbo.model <- sbo_predtable(object = (unlist(data[, 'sentence'])), 
                           N = 4, 
                           dict = en_US.dic, 
                           .preprocess = sbo::preprocess,
                           EOS = ".?!:;", 
                           lambda = 0.3,
                           L = 30L, 
                           filtered = c("<UNK>", "<EOS>") 
)

Gives me 67% accuracy on the test data which is impressive. Incidentally, on everything I trained, I got the best accuracy with lambda=0.3.

I pushed the model into a demo on Shiny - crude interface but does the job ;)

Was thinking to use this in a non-NLP scenario - sequence prediction in pseudo-chaotic environments, or unusual behaviour detection (when observed isn't in list of n predicted outcomes etc).

Thanks again!

The first thing i noticed about this post in may from another part of the problem and a small token of my appreciation of myself of the world and the media ... [Ancient Markoff wisdom]

If I understand correctly, you want to avoid predicting words outside of your custom dictionary (en_US.dic). If so, you are already achieving this through the code you're showing here: the filtered = c("<UNK>", "<EOS>") establishes that the Unknown word, as well as the End Of Sentence token, will never be predicted; hence only actual words from your custom dict are output by the model.

Just out of curiosity, what is the exact metric you are referring to (67% accuracy)? Accuracy at the top 30 predictions (i.e. a word is correctly predicted if within the top 30 predictions)? I'm asking because 67% accuracy sounds quite high for this kind of model.

PS, with a bit of self-advertisement: if you're interested into using Ngram models in a more general setting, I suggest to give a look to kgrams, which supports more general (and more powerful) models than stupid back-off, and has an improved API (at some point I'd love to reconcile the interfaces of sbo and kgrams...). From my side, I can tell you I've been using Markov models on product analytics data with surprisingly nice results :-)

Sure, the original question was relating to pruning the dictionary to avoid bloating the model size - in the end, 135MB wasn't a problem for the purpose so pruning wasn't necessary. I added the <EOS> since the purpose was a mock SwiftKey app, so end of sentence predictions weren't desired here. 30 predictions because I wanted to replace top 3 predictions with any predictions matching first letters of next typed word in order of likelihood.

I started with training the dictionary straight from the corpus, but got too much junk from the twitter samples so switched to a regular dictionary. So I figure if I was going to prune the external dictionary, I'd need to to rank the dictionary words in the corpus first, then slice the top n words from the dictionary before training the model.

The original corpus (3,000,000 samples from blog/news/twitter sources) I split into train/test/validation subsets, so the 67% accuracy was running the predictor with 100,000 sentences from the validation set. Pre-processing was stripping out numbers, URL's and punctuation other than apostrophes then splitting everything into sentences, pruning sentences less than 4 words (since I was running with N=4).

Evaluation counted ignoring lines where <EOS> was the true since the filter precluded those. I guess it's high due to the 30 chances of being right - it was around 55% on the earlier test models with 3 predictions which I'm still really impressed by.

evaluation <- eval_sbo_predictor(sbo.pred, test = unlist(valid_data[, 'sentence'] ))
evaluation %>% 
    filter(true != "<EOS>") %>% 
    summarise(
    accuracy = sum(correct)/n(), 
    uncertainty = sqrt(accuracy * (1 - accuracy) / n()
    )
)

Thanks for the k-grams link! I'll follow up when I get a chance.