nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance improvement?

dakinggg opened this issue · comments

I am not certain of this, but I suspect there might be room for performance improvement by using re.compile to precompile all of the needed regexs. Otherwise they will have to be compiled regularly (once the re cache of 100 has been exceeded)

I don't think there will be any perceivable difference unless some of the regexes are in the loop.

Quote from the Python 3 docs for your perusal:

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

I think that if there are more than 100 regexs (including any that are in loops like here:

for abbr in Abbreviation.ABBREVIATIONS:
) the cache will cycle, and regexs will have to be recompiled. Given that the number of ABBREVIATIONS alone is 188, I suspect that the cache is cycling.

Here is another regex in a loop:

for ind, item in enumerate(list_array):

Yes, I agree there will be such a regex within a loop. In that case, would you mind tweaking it with precompiled ones and assess the performance? I would love to assist with that.

Not sure when I will have time to do it, but I can try at some point

Cool, same here! will let you know if I happen to do this performance exercise.