Detab function performance and \t matching in regexes
GoogleCodeExporter opened this issue · comments
Google Code Exporter commented
The original Detab function uses a regex (_deTab) to convert tabs to an
appropriate amount of spaces. Looking at this approach, it seems that a lot
of strings are generated rather unnecessary, and a lot of overhead is
involved.
Preliminary performance benchmark was:
input string length: 475
6000 iterations in 4691 ms (0,781833333333333 ms per iteration)
input string length: 2356
1500 iterations in 4489 ms (2,99266666666667 ms per iteration)
input string length: 27737
150 iterations in 5159 ms (34,3933333333333 ms per iteration)
input string length: 11075
300 iterations in 4506 ms (15,02 ms per iteration)
input string length: 88607
35 iterations in 4268 ms (121,942857142857 ms per iteration)
input string length: 354431
10 iterations in 4857 ms (485,7 ms per iteration)
By doing this fairly crucial (many input lines match _deTab at least once)
part "manually" - see attached diff file for exact code - an about 10%
performance gain can be achieved:
input string length: 475
6000 iterations in 4187 ms (0,697833333333333 ms per iteration)
input string length: 2356
1500 iterations in 4011 ms (2,674 ms per iteration)
input string length: 27737
150 iterations in 4464 ms (29,76 ms per iteration)
input string length: 11075
300 iterations in 4087 ms (13,6233333333333 ms per iteration)
input string length: 88607
35 iterations in 3922 ms (112,057142857143 ms per iteration)
input string length: 354431
10 iterations in 4451 ms (445,1 ms per iteration)
Also, it is now fairly obvious that all '\t' are removed from the input
stream at this point, therefore all "[ \t]" (or similar) parts in the
regular expressions can have their tab-matching stripped out. The savings
in this (apart from readability) vary pretty big with respect to the input:
input string length: 475
6000 iterations in 4114 ms (0,685666666666667 ms per iteration)
input string length: 2356
1500 iterations in 3826 ms (2,55066666666667 ms per iteration)
input string length: 27737
150 iterations in 4239 ms (28,26 ms per iteration)
input string length: 11075
300 iterations in 3884 ms (12,9466666666667 ms per iteration)
input string length: 88607
35 iterations in 3719 ms (106,257142857143 ms per iteration)
input string length: 354431
10 iterations in 4264 ms (426,4 ms per iteration)
All unit tests pass the same after these modifications.
Original issue reported on code.google.com by Shio...@gmail.com
on 10 Jan 2010 at 1:43
Attachments:
Google Code Exporter commented
Actually, I just noticed that "CodeBlockEvaluator" calls Detab again. Since no
tabs are
inserted anywhere, this precaution is unnecessary and should be removable. This
does
not have any measurable impact on the Benchmark results on my machine, however.
Original comment by Shio...@gmail.com
on 10 Jan 2010 at 2:30
Google Code Exporter commented
I was thinking that we needed to do this, as PHPMarkdown does the same. Makes
WAY
more sense.
Excellent and very helpful contribution, checked in as r96
(and yes, I removed all the /t from regex and the codeblockevaluator call as
well --
we are TAB-FREE!)
Original comment by wump...@gmail.com
on 11 Jan 2010 at 2:29
- Changed state: Fixed
Google Code Exporter commented
I think we should merge the newline normalization code and the detab code into
the same
loop -- don't you? Can you give it a shot?
Original comment by wump...@gmail.com
on 11 Jan 2010 at 4:53
Google Code Exporter commented
Yep, will do.
Original comment by Shio...@gmail.com
on 11 Jan 2010 at 5:06