google / re2

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is there any suitable case to prove that google-re2 is faster in python?

betterlch opened this issue · comments

commented

Here is a sample from new bing:

import re
import re2
import timeit

pattern = r"\w+"
text = "Hello, world!"

re_time = timeit.timeit(lambda: re.search(pattern, text), number=100000)
re2_time = timeit.timeit(lambda: re2.search(pattern, text), number=100000)

print(f"re time: {re_time:.6f} seconds")
print(f"re2 time: {re2_time:.6f} seconds")
result: 
  re time: 0.153945 seconds
  re2 time: 2.977748 seconds

And then, I generate a 50W+ string sequence to match,

import random
import string

seq = string.punctuation + string.ascii_letters + string.digits

keys = set()
for i in range(1, len(seq)):
    gen = {''.join(random.sample(seq, i)) for _ in range(0, 10000)}
    keys |= gen
with open('random.txt', 'w') as f:
    f.writelines('\n'.join(keys))

re2 is also slower.

Maybe is my usage incorrect?

Sure, if you change pattern to r"(?:.?){12}(?:.){12}", then the printed output will look something like this:

re time: 7.812947 seconds
re2 time: 1.826364 seconds
commented

Sure, if you change pattern to r"(?:.?){12}(?:.){12}", then the printed output will look something like this:

re time: 7.812947 seconds
re2 time: 1.826364 seconds

it means that i should translate my regex from python re syntax to google-re2 syntax so that i can make it faster?
And use more wildcards instead of determine characters seems is more suitable to re2

Please read the WhyRE2 wiki page. I agree that performance is a feature; I disagree that performance is the only feature that should be considered when choosing a regular expression library. Note also that application-specific metrics would be vastly preferable to rando benchmarks when making a decision if performance is indeed critical.

@betterlch Application specific metrics are also important for understanding use patterns. For example, your Python program passes Python strings, which causes an encoding step in every regex search call and offset translation. If you instead do:

pattern = r"\w+".encode('utf-8')
text = "Hello, world!".encode('utf-8')

then your RE2 benchmark gets a little faster. The way you use regexes in your application may or may not make this plausible for you to do.

commented

@betterlch Application specific metrics are also important for understanding use patterns. For example, your Python program passes Python strings, which causes an encoding step in every regex search call and offset translation. If you instead do:

pattern = r"\w+".encode('utf-8')
text = "Hello, world!".encode('utf-8')

then your RE2 benchmark gets a little faster. The way you use regexes in your application may or may not make this plausible for you to do.

Yes, that is an useful details, I find it by cProfile. With changing my regex, re2 has reached my desired effect.

commented

Please read the WhyRE2 wiki page. I agree that performance is a feature; I disagree that performance is the only feature that should be considered when choosing a regular expression library. Note also that application-specific metrics would be vastly preferable to rando benchmarks when making a decision if performance is indeed critical.

I agree with you, but because my existing dataset is very small,so I choose random to gengrate string seq, chosse according to local conditions is really what i should do.