stringr functions not working for strings with \r or \n

Question

stringr functions not working for strings with \r or \n

Athospd opened this issue a year ago · comments

Athos Petri Damiani commented a year ago

Is it a bug?
PS: str_detect() has this behavior too.

pattern <- "@.*@" 
input <- "NOT WANTED @ WANTED\r @ NOT WANTED"

# stringr
stringr::str_extract_all(input, pattern)
#> [[1]]
#> character(0)

# idomatic
regmatches(input, regexpr(pattern, input))
#> [1] "@ WANTED\r @"


pattern <- "@.*@" 
input <- "NOT WANTED @ WANTED\n @ NOT WANTED"

# stringr
stringr::str_extract_all(input, pattern)
#> [[1]]
#> character(0)

# idomatic
regmatches(input, regexpr(pattern, input))
#> [1] "@ WANTED\n @"

^{Created on 2023-07-10 with reprex v2.0.2}

Marek Gagolewski · Answer 1 · Tue Jul 11 2023 06:50:24 GMT+0800 (China Standard Time)

The regex dot does match a newline character by default, but see the 'dotall' flag/setting in the stringi paper https://www.jstatsoft.org/article/view/v103i02

Athos Petri Damiani · Answer 2 · Tue Jul 11 2023 16:08:39 GMT+0800 (China Standard Time)

But "@[^(@@@@)]*@" make it work back again.
How is it not a bug? I couldn't find the answer in this link.

Marek Gagolewski · Answer 3 · Tue Jul 11 2023 16:54:52 GMT+0800 (China Standard Time)

The meaning of [^(@@@@)] is: any character except (, @, and ). This includes the newline.

Another good tutorial on regexes is https://www.regular-expressions.info/

Athos Petri Damiani · Answer 4 · Tue Jul 11 2023 21:06:05 GMT+0800 (China Standard Time)

But they are behaving differently. Could I ask you for a more specific explanation? Those links are too vague.

They should return the same output, but they are not. Don't you agree with that?

when using the base-R things work differently from stringr. In my mind the logic is: The regex pattern is the same, so it should return the same output. Where am I getting it wrong?

Marek Gagolewski · Answer 5 · Wed Jul 12 2023 09:50:26 GMT+0800 (China Standard Time)

If you pass perl=TRUE to the R regex functions, you will get the same behaviour, i.e., . not matching a newline by default. Regexes are a powerful tool, but the regex engines differ from each other. This is how they are implemented, it is part of their specification.

For ICU regexes (used in stringi and hence stringer), see https://unicode-org.github.io/icu/userguide/strings/regexp.html

For PCRE regexes (perl=TRUE) in base R, see https://www.pcre.org/current/doc/html/pcre2pattern.html

For TRE regexes (perl=FALSE - default), see https://github.com/laurikari/tre/ -- but these are not particularly well-documented.

I would rather say it is TRE that does things differently, not ICU/PCRE

Even Python regexes (https://docs.python.org/3/howto/regex.html) have the DOTALL distinction.

HTH

Athos Petri Damiani · Answer 6 · Wed Jul 12 2023 15:39:06 GMT+0800 (China Standard Time)

Hum interesting, thank you for this. I'm closing the issue!

Hadley Wickham · Answer 7 · Mon Aug 07 2023 23:51:29 GMT+0800 (China Standard Time)

Also see the dotall argument to regex().