NIHOPA / NLPre

Python library for Natural Language Preprocessing (NLPre)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Speed up replace_from_dictionary

thoppe opened this issue · comments

Even though speed isn't our top concern, replace_from_dictionary is orders of magnitude slower than most functions.

                                    time      frac
function                                          
unidecoder                      0.000008  0.000018
token_replacement               0.000010  0.000022
dedash                          0.000535  0.001172
titlecaps                       0.003216  0.007043
decaps_text                     0.003802  0.008327
identify_parenthetical_phrases  0.009862  0.021598
replace_acronyms                0.012591  0.027574
separated_parenthesis           0.013224  0.028960
pos_tokenizer                   0.068994  0.151094
replace_from_dictionary         0.344384  0.754191

Profiling suggests this takes up a significant fraction of time and could probably be refactored:

        # Identify which phrases were used and possible replacements
        R = collections.defaultdict(list)
        for key, val in self.rdict.iteritems():
            if key in ldoc:
                R[val].append(key)

By matching to a unique subset of the words first (with proper breaks for punctuation), we can dramatically cut down on the time needed for this module by about 50%!

                                    time      frac
function                                          
token_replacement               0.000007  0.000035
unidecoder                      0.000009  0.000043
dedash                          0.000346  0.001635
titlecaps                       0.001804  0.008525
decaps_text                     0.002472  0.011680
identify_parenthetical_phrases  0.005658  0.026733
replace_acronyms                0.006414  0.030304
separated_parenthesis           0.006859  0.032405
pos_tokenizer                   0.060114  0.284009
replace_from_dictionary         0.127977  0.604630