Speed up replace_from_dictionary
thoppe opened this issue · comments
Travis Hoppe commented
Even though speed isn't our top concern, replace_from_dictionary
is orders of magnitude slower than most functions.
time frac
function
unidecoder 0.000008 0.000018
token_replacement 0.000010 0.000022
dedash 0.000535 0.001172
titlecaps 0.003216 0.007043
decaps_text 0.003802 0.008327
identify_parenthetical_phrases 0.009862 0.021598
replace_acronyms 0.012591 0.027574
separated_parenthesis 0.013224 0.028960
pos_tokenizer 0.068994 0.151094
replace_from_dictionary 0.344384 0.754191
Travis Hoppe commented
Profiling suggests this takes up a significant fraction of time and could probably be refactored:
# Identify which phrases were used and possible replacements
R = collections.defaultdict(list)
for key, val in self.rdict.iteritems():
if key in ldoc:
R[val].append(key)
Travis Hoppe commented
By matching to a unique subset of the words first (with proper breaks for punctuation), we can dramatically cut down on the time needed for this module by about 50%!
time frac
function
token_replacement 0.000007 0.000035
unidecoder 0.000009 0.000043
dedash 0.000346 0.001635
titlecaps 0.001804 0.008525
decaps_text 0.002472 0.011680
identify_parenthetical_phrases 0.005658 0.026733
replace_acronyms 0.006414 0.030304
separated_parenthesis 0.006859 0.032405
pos_tokenizer 0.060114 0.284009
replace_from_dictionary 0.127977 0.604630