Is there any suitable case to prove that google-re2 is faster in python?
betterlch opened this issue · comments
Here is a sample from new bing:
import re
import re2
import timeit
pattern = r"\w+"
text = "Hello, world!"
re_time = timeit.timeit(lambda: re.search(pattern, text), number=100000)
re2_time = timeit.timeit(lambda: re2.search(pattern, text), number=100000)
print(f"re time: {re_time:.6f} seconds")
print(f"re2 time: {re2_time:.6f} seconds")
result:
re time: 0.153945 seconds
re2 time: 2.977748 seconds
And then, I generate a 50W+ string sequence to match,
import random
import string
seq = string.punctuation + string.ascii_letters + string.digits
keys = set()
for i in range(1, len(seq)):
gen = {''.join(random.sample(seq, i)) for _ in range(0, 10000)}
keys |= gen
with open('random.txt', 'w') as f:
f.writelines('\n'.join(keys))
re2 is also slower.
Maybe is my usage incorrect?
Sure, if you change pattern
to r"(?:.?){12}(?:.){12}"
, then the printed output will look something like this:
re time: 7.812947 seconds
re2 time: 1.826364 seconds
Sure, if you change
pattern
tor"(?:.?){12}(?:.){12}"
, then the printed output will look something like this:re time: 7.812947 seconds re2 time: 1.826364 seconds
it means that i should translate my regex from python re syntax to google-re2 syntax so that i can make it faster?
And use more wildcards instead of determine characters seems is more suitable to re2
Please read the WhyRE2 wiki page. I agree that performance is a feature; I disagree that performance is the only feature that should be considered when choosing a regular expression library. Note also that application-specific metrics would be vastly preferable to rando benchmarks when making a decision if performance is indeed critical.
@betterlch Application specific metrics are also important for understanding use patterns. For example, your Python program passes Python strings, which causes an encoding step in every regex search call and offset translation. If you instead do:
pattern = r"\w+".encode('utf-8')
text = "Hello, world!".encode('utf-8')
then your RE2 benchmark gets a little faster. The way you use regexes in your application may or may not make this plausible for you to do.
@betterlch Application specific metrics are also important for understanding use patterns. For example, your Python program passes Python strings, which causes an encoding step in every regex search call and offset translation. If you instead do:
pattern = r"\w+".encode('utf-8') text = "Hello, world!".encode('utf-8')then your RE2 benchmark gets a little faster. The way you use regexes in your application may or may not make this plausible for you to do.
Yes, that is an useful details, I find it by cProfile. With changing my regex, re2 has reached my desired effect.
Please read the WhyRE2 wiki page. I agree that performance is a feature; I disagree that performance is the only feature that should be considered when choosing a regular expression library. Note also that application-specific metrics would be vastly preferable to rando benchmarks when making a decision if performance is indeed critical.
I agree with you, but because my existing dataset is very small,so I choose random to gengrate string seq, chosse according to local conditions is really what i should do.