tidyverse / stringr

A fresh approach to string manipulation in R

Home Page:https://stringr.tidyverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

stringr functions not working for strings with \r or \n

Athospd opened this issue · comments

Is it a bug?
PS: str_detect() has this behavior too.

pattern <- "@.*@" 
input <- "NOT WANTED @ WANTED\r @ NOT WANTED"

# stringr
stringr::str_extract_all(input, pattern)
#> [[1]]
#> character(0)

# idomatic
regmatches(input, regexpr(pattern, input))
#> [1] "@ WANTED\r @"


pattern <- "@.*@" 
input <- "NOT WANTED @ WANTED\n @ NOT WANTED"

# stringr
stringr::str_extract_all(input, pattern)
#> [[1]]
#> character(0)

# idomatic
regmatches(input, regexpr(pattern, input))
#> [1] "@ WANTED\n @"

Created on 2023-07-10 with reprex v2.0.2

The regex dot does match a newline character by default, but see the 'dotall' flag/setting in the stringi paper https://www.jstatsoft.org/article/view/v103i02

But "@[^(@@@@)]*@" make it work back again.
How is it not a bug? I couldn't find the answer in this link.

The meaning of [^(@@@@)] is: any character except (, @, and ). This includes the newline.

Another good tutorial on regexes is https://www.regular-expressions.info/

But they are behaving differently. Could I ask you for a more specific explanation? Those links are too vague.

They should return the same output, but they are not. Don't you agree with that?

when using the base-R things work differently from stringr. In my mind the logic is: The regex pattern is the same, so it should return the same output. Where am I getting it wrong?

If you pass perl=TRUE to the R regex functions, you will get the same behaviour, i.e., . not matching a newline by default. Regexes are a powerful tool, but the regex engines differ from each other. This is how they are implemented, it is part of their specification.

For ICU regexes (used in stringi and hence stringer), see https://unicode-org.github.io/icu/userguide/strings/regexp.html

For PCRE regexes (perl=TRUE) in base R, see https://www.pcre.org/current/doc/html/pcre2pattern.html

For TRE regexes (perl=FALSE - default), see https://github.com/laurikari/tre/ -- but these are not particularly well-documented.

I would rather say it is TRE that does things differently, not ICU/PCRE

Even Python regexes (https://docs.python.org/3/howto/regex.html) have the DOTALL distinction.

HTH

Hum interesting, thank you for this. I'm closing the issue!

Also see the dotall argument to regex().