google / re2

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question on RE2::Match latency for patterns containing a large number of match groups

huajosep opened this issue · comments

We recently observed that the latency for RE2::Match, when executed on a long string containing 40 + match groups in the pattern, can be quite high.

please consider the below example

input string

input "2024-01-22_09:15:30 example.net 1|_x-columnName1=ExampleValue1 columnName2=ExampleValue2 columnName3=ExampleValue3 columnName4=ExampleValue4 columnName5=ExampleValue5 columnName6=ExampleValue6 columnName7=ExampleValue7 columnName8=ExampleValue8 columnName9=ExampleValue9 columnName10=ExampleValue10 columnName11=ExampleValue11 columnName12=ExampleValue12 columnName13=ExampleValue13 columnName14=ExampleValue14 columnName15=ExampleValue15 columnName16=ExampleValue16 columnName17=ExampleValue17 columnName18=ExampleValue18 columnName19=ExampleValue19 columnName20=ExampleValue20 columnName21=ExampleValue21 columnName22=ExampleValue22 columnName23=ExampleValue23 columnName24=ExampleValue24 columnName25=ExampleValue25 columnName26=ExampleValue26 columnName27=ExampleValue27 columnName28=ExampleValue28 columnName29=ExampleValue29 columnName30=ExampleValue30 columnName31=ExampleValue31 columnName32=ExampleValue32 columnName33=ExampleValue33 columnName34=ExampleValue34 columnName35=ExampleValue35 columnName36=ExampleValue36 columnName37=ExampleValue37 columnName38=ExampleValue38 columnName39=ExampleValue39 columnName40=ExampleValue40 columnName41=ExampleValue41 columnName42=ExampleValue42 columnName43=ExampleValue43 columnName44=ExampleValue44 columnName45=ExampleValue45 columnName46=ExampleValue46 columnName47=ExampleValue47 columnName48=ExampleValue48 columnName49=ExampleValue49 columnName50=ExampleValue50 columnName51=ExampleValue51 columnName52=ExampleValue52 columnName53=ExampleValue53 columnName54=ExampleValue54 columnName55=ExampleValue55 columnName56=ExampleValue56 columnName57=ExampleValue57 columnName58=ExampleValue58 columnName59=ExampleValue59 columnName60=ExampleValue60 columnName61=ExampleValue61 columnName62=ExampleValue62 columnName63=ExampleValue63 columnName64=ExampleValue64 columnName65=ExampleValue65 columnName66=ExampleValue66 columnName67=ExampleValue67 columnName68=ExampleValue68 columnName69=ExampleValue69 columnName70=ExampleValue70 columnName71=ExampleValue71 columnName72=ExampleValue72 columnName73=ExampleValue73 columnName74=ExampleValue74 columnName75=ExampleValue75 columnName76=ExampleValue76 columnName77=ExampleValue77 columnName78=ExampleValue78 columnName79=ExampleValue79 columnName80=ExampleValue80 columnName81=ExampleValue81 columnName82=ExampleValue82 columnName83=ExampleValue83 columnName84=ExampleValue84 columnName85=ExampleValue85 columnName86=ExampleValue86 columnName87=ExampleValue87 columnName88=ExampleValue88 columnName89=ExampleValue89 columnName90=ExampleValue90 columnName91=ExampleValue91 columnName92=ExampleValue92 columnName93=ExampleValue93 columnName94=ExampleValue94 columnName95=ExampleValue95 columnName96=ExampleValue96 columnName97=ExampleValue97 columnName98=ExampleValue98 columnName99=ExampleValue99 columnName100=ExampleValue100"

when we tested with a single pattern which contains 100 match groups, the latency was 10357 ms.

pattern = (.*?)\Q \E(.*?)\Q1|_x-columnName1=\E(.*?)\QcolumnName2=\E(.*?)\QcolumnName3=\E(.*?)\QcolumnName4=\E(.*?)\QcolumnName5=\E(.*?)\QcolumnName6=\E(.*?)\QcolumnName7=\E(.*?)\QcolumnName8=\E(.*?)\QcolumnName9=\E(.*?)\QcolumnName10=\E(.*?)\QcolumnName11=\E(.*?)\QcolumnName12=\E(.*?)\QcolumnName13=\E(.*?)\QcolumnName14=\E(.*?)\QcolumnName15=\E(.*?)\QcolumnName16=\E(.*?)\QcolumnName17=\E(.*?)\QcolumnName18=\E(.*?)\QcolumnName19=\E(.*?)\QcolumnName20=\E(.*?)\QcolumnName21=\E(.*?)\QcolumnName22=\E(.*?)\QcolumnName23=\E(.*?)\QcolumnName24=\E(.*?)\QcolumnName25=\E(.*?)\QcolumnName26=\E(.*?)\QcolumnName27=\E(.*?)\QcolumnName28=\E(.*?)\QcolumnName29=\E(.*?)\QcolumnName30=\E(.*?)\QcolumnName31=\E(.*?)\QcolumnName32=\E(.*?)\QcolumnName33=\E(.*?)\QcolumnName34=\E(.*?)\QcolumnName35=\E(.*?)\QcolumnName36=\E(.*?)\QcolumnName37=\E(.*?)\QcolumnName38=\E(.*?)\QcolumnName39=\E(.*?)\QcolumnName40=\E(.*?)\QcolumnName41=\E(.*?)\QcolumnName42=\E(.*?)\QcolumnName43=\E(.*?)\QcolumnName44=\E(.*?)\QcolumnName45=\E(.*?)\QcolumnName46=\E(.*?)\QcolumnName47=\E(.*?)\QcolumnName48=\E(.*?)\QcolumnName49=\E(.*?)\QcolumnName50=\E(.*?)\QcolumnName51=\E(.*?)\QcolumnName52=\E(.*?)\QcolumnName53=\E(.*?)\QcolumnName54=\E(.*?)\QcolumnName55=\E(.*?)\QcolumnName56=\E(.*?)\QcolumnName57=\E(.*?)\QcolumnName58=\E(.*?)\QcolumnName59=\E(.*?)\QcolumnName60=\E(.*?)\QcolumnName61=\E(.*?)\QcolumnName62=\E(.*?)\QcolumnName63=\E(.*?)\QcolumnName64=\E(.*?)\QcolumnName65=\E(.*?)\QcolumnName66=\E(.*?)\QcolumnName67=\E(.*?)\QcolumnName68=\E(.*?)\QcolumnName69=\E(.*?)\QcolumnName70=\E(.*?)\QcolumnName71=\E(.*?)\QcolumnName72=\E(.*?)\QcolumnName73=\E(.*?)\QcolumnName74=\E(.*?)\QcolumnName75=\E(.*?)\QcolumnName76=\E(.*?)\QcolumnName77=\E(.*?)\QcolumnName78=\E(.*?)\QcolumnName79=\E(.*?)\QcolumnName80=\E(.*?)\QcolumnName81=\E(.*?)\QcolumnName82=\E(.*?)\QcolumnName83=\E(.*?)\QcolumnName84=\E(.*?)\QcolumnName85=\E(.*?)\QcolumnName86=\E(.*?)\QcolumnName87=\E(.*?)\QcolumnName88=\E(.*?)\QcolumnName89=\E(.*?)\QcolumnName90=\E(.*?)\QcolumnName91=\E(.*?)\QcolumnName92=\E(.*?)\QcolumnName93=\E(.*?)\QcolumnName94=\E(.*?)\QcolumnName95=\E(.*?)\QcolumnName96=\E(.*?)\QcolumnName97=\E(.*?)\QcolumnName98=\E(.*?)\QcolumnName99=\E(.*?)\QcolumnName100=\E(.*)

However, after breaking down the above pattern into multiple sub-patterns, each with 10 match groups, the latency dropped down to 1261 ms

pattern1 = (.*?)\Q \E(.*?)\Q1|_x-columnName1=\E(.*?)\QcolumnName2=\E(.*?)\QcolumnName3=\E(.*?)\QcolumnName4=\E(.*?)\QcolumnName5=\E(.*?)\QcolumnName6=\E(.*?)\QcolumnName7=\E(.*?)\QcolumnName8=\E(.*?)\QcolumnName9=\E(.*?)\QcolumnName10=\E(.*?)\QcolumnName11\E
pattern2 = \QcolumnName11=\E(.*?)\QcolumnName12=\E(.*?)\QcolumnName13=\E(.*?)\QcolumnName14=\E(.*?)\QcolumnName15=\E(.*?)\QcolumnName16=\E(.*?)\QcolumnName17=\E(.*?)\QcolumnName18=\E(.*?)\QcolumnName19=\E(.*?)\QcolumnName20=\E(.*?)\QcolumnName21\E
pattern3 = \QcolumnName21=\E(.*?)\QcolumnName22=\E(.*?)\QcolumnName23=\E(.*?)\QcolumnName24=\E(.*?)\QcolumnName25=\E(.*?)\QcolumnName26=\E(.*?)\QcolumnName27=\E(.*?)\QcolumnName28=\E(.*?)\QcolumnName29=\E(.*?)\QcolumnName30=\E(.*?)\QcolumnName31\E
pattern4 = \QcolumnName31=\E(.*?)\QcolumnName32=\E(.*?)\QcolumnName33=\E(.*?)\QcolumnName34=\E(.*?)\QcolumnName35=\E(.*?)\QcolumnName36=\E(.*?)\QcolumnName37=\E(.*?)\QcolumnName38=\E(.*?)\QcolumnName39=\E(.*?)\QcolumnName40=\E(.*?)\QcolumnName41\E
pattern5 = \QcolumnName41=\E(.*?)\QcolumnName42=\E(.*?)\QcolumnName43=\E(.*?)\QcolumnName44=\E(.*?)\QcolumnName45=\E(.*?)\QcolumnName46=\E(.*?)\QcolumnName47=\E(.*?)\QcolumnName48=\E(.*?)\QcolumnName49=\E(.*?)\QcolumnName50=\E(.*?)\QcolumnName51\E
pattern6 = \QcolumnName51=\E(.*?)\QcolumnName52=\E(.*?)\QcolumnName53=\E(.*?)\QcolumnName54=\E(.*?)\QcolumnName55=\E(.*?)\QcolumnName56=\E(.*?)\QcolumnName57=\E(.*?)\QcolumnName58=\E(.*?)\QcolumnName59=\E(.*?)\QcolumnName60=\E(.*?)\QcolumnName61\E
pattern7 = \QcolumnName61=\E(.*?)\QcolumnName62=\E(.*?)\QcolumnName63=\E(.*?)\QcolumnName64=\E(.*?)\QcolumnName65=\E(.*?)\QcolumnName66=\E(.*?)\QcolumnName67=\E(.*?)\QcolumnName68=\E(.*?)\QcolumnName69=\E(.*?)\QcolumnName70=\E(.*?)\QcolumnName71\E
pattern8 = \QcolumnName71=\E(.*?)\QcolumnName72=\E(.*?)\QcolumnName73=\E(.*?)\QcolumnName74=\E(.*?)\QcolumnName75=\E(.*?)\QcolumnName76=\E(.*?)\QcolumnName77=\E(.*?)\QcolumnName78=\E(.*?)\QcolumnName79=\E(.*?)\QcolumnName80=\E(.*?)\QcolumnName81\E
pattern9 = \QcolumnName81=\E(.*?)\QcolumnName82=\E(.*?)\QcolumnName83=\E(.*?)\QcolumnName84=\E(.*?)\QcolumnName85=\E(.*?)\QcolumnName86=\E(.*?)\QcolumnName87=\E(.*?)\QcolumnName88=\E(.*?)\QcolumnName89=\E(.*?)\QcolumnName90=\E(.*?)\QcolumnName91\E
pattern10 = \QcolumnName91=\E(.*?)\QcolumnName92=\E(.*?)\QcolumnName93=\E(.*?)\QcolumnName94=\E(.*?)\QcolumnName95=\E(.*?)\QcolumnName96=\E(.*?)\QcolumnName97=\E(.*?)\QcolumnName98=\E(.*?)\QcolumnName99=\E(.*?)\QcolumnName100=\E(.*)

The type of match we use is RE2::Anchor::UNANCHORED

Is such latency expected? Shouldn't the complexity for both cases be linear? We are using an older version of RE2 (from 2018). Are you aware of any recent changes that might have already addressed this issue?

IMO, this is an egregious misuse of regular expressions. Between that and the ~6yo version of RE2, it wouldn't be reasonable for me to spend time (either professionally or personally) digging into why the linear-time constant seems to be larger for the single pattern. There are better (i.e. more readable, more maintainable, more efficient) ways of parsing data in such a format. (The data would be in a better format, ideally, but that may or may not be within your control.)

Thank you for the quick response. Unfortunately, we don’t have much control over both the data and the pattern (parse query) as they come from our customers. We agree that such a pattern is not ideal. To us, it appears that the latency grows exponentially with the increase in the number of match groups in the pattern. If you can confirm that RE2::Match always maintains linear performance in the latest version of RE2, even for the edge case we encountered, it would greatly help us to determine whether an upgrade to RE2 is a viable solution.

If you have a test case you should be able to try the latest RE2 yourself.

Also, it seems unsound to draw conclusions about asymptotic complexity from two (2) data points. For now, my guess is that the DFA and NFA execution engines incur overhead (i.e. increase the linear-time constant) due to the large number of (.*?) subexpressions. (Specifically, combining so much ambiguity with so many capturing groups would amplify cost significantly.) Updating the version of RE2 could be a mitigation, but considering it a solution would be a categorical mistake.