Is there any suitable case to prove that google-re2 is faster in python?

Question

Is there any suitable case to prove that google-re2 is faster in python?

betterlch opened this issue a year ago · comments

Here is a sample from new bing:

import re
import re2
import timeit

pattern = r"\w+"
text = "Hello, world!"

re_time = timeit.timeit(lambda: re.search(pattern, text), number=100000)
re2_time = timeit.timeit(lambda: re2.search(pattern, text), number=100000)

print(f"re time: {re_time:.6f} seconds")
print(f"re2 time: {re2_time:.6f} seconds")

result: 
  re time: 0.153945 seconds
  re2 time: 2.977748 seconds

And then, I generate a 50W+ string sequence to match,

import random
import string

seq = string.punctuation + string.ascii_letters + string.digits

keys = set()
for i in range(1, len(seq)):
    gen = {''.join(random.sample(seq, i)) for _ in range(0, 10000)}
    keys |= gen
with open('random.txt', 'w') as f:
    f.writelines('\n'.join(keys))

re2 is also slower.

Maybe is my usage incorrect？

Paul Wankadia · Answer 1 · Fri Apr 07 2023 14:46:00 GMT+0800 (China Standard Time)

Sure, if you change pattern to r"(?:.?){12}(?:.){12}", then the printed output will look something like this:

re time: 7.812947 seconds
re2 time: 1.826364 seconds

Secret · Answer 2 · Fri Apr 07 2023 17:36:32 GMT+0800 (China Standard Time)

Sure, if you change pattern to r"(?:.?){12}(?:.){12}", then the printed output will look something like this:
re time: 7.812947 seconds
re2 time: 1.826364 seconds

it means that i should translate my regex from python re syntax to google-re2 syntax so that i can make it faster?
And use more wildcards instead of determine characters seems is more suitable to re2

Paul Wankadia · Answer 3 · Fri Apr 07 2023 19:11:08 GMT+0800 (China Standard Time)

Please read the WhyRE2 wiki page. I agree that performance is a feature; I disagree that performance is the only feature that should be considered when choosing a regular expression library. Note also that application-specific metrics would be vastly preferable to rando benchmarks when making a decision if performance is indeed critical.

Andrew Gallant · Answer 4 · Fri Apr 07 2023 19:24:26 GMT+0800 (China Standard Time)

@betterlch Application specific metrics are also important for understanding use patterns. For example, your Python program passes Python strings, which causes an encoding step in every regex search call and offset translation. If you instead do:

pattern = r"\w+".encode('utf-8')
text = "Hello, world!".encode('utf-8')

then your RE2 benchmark gets a little faster. The way you use regexes in your application may or may not make this plausible for you to do.

Secret · Answer 5 · Fri Apr 07 2023 19:58:46 GMT+0800 (China Standard Time)

@betterlch Application specific metrics are also important for understanding use patterns. For example, your Python program passes Python strings, which causes an encoding step in every regex search call and offset translation. If you instead do:
pattern = r"\w+".encode('utf-8')
text = "Hello, world!".encode('utf-8')
then your RE2 benchmark gets a little faster. The way you use regexes in your application may or may not make this plausible for you to do.

Yes, that is an useful details, I find it by cProfile. With changing my regex, re2 has reached my desired effect.

Secret · Answer 6 · Fri Apr 07 2023 20:39:39 GMT+0800 (China Standard Time)

Please read the WhyRE2 wiki page. I agree that performance is a feature; I disagree that performance is the only feature that should be considered when choosing a regular expression library. Note also that application-specific metrics would be vastly preferable to rando benchmarks when making a decision if performance is indeed critical.

I agree with you, but because my existing dataset is very small，so I choose random to gengrate string seq, chosse according to local conditions is really what i should do.