beyondgrep / ack2

**ack 2 is no longer being maintained. ack 3 is the latest version.**

Home Page:https://github.com/beyondgrep/ack3/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

-w does not interact well with metacharacters

epa opened this issue · comments

-w is documented as matching whole words only, and that it works by wrapping the pattern in \b metacharacters. Given this, you would expect two equivalent regexps to give the same behaviour, e.g.

abc
(?:abc)

These are exactly equivalent perl5 regular expressions, but they behave differently with -w:

% mkdir n; cd n
% echo abcde >foo
% ack -w 'abc'
% ack -w '(?:abc)'
foo
1:abcde

The fix is quite simple: at present the code doesn't add the \b anchor at the start if the regexp doesn't start with a word character, ditto end. That doesn't match the documentation, which says that the regexp is wrapped with \b unconditionally. Patch to make behaviour match docs:

diff --git a/ack b/ack
index 38ba811..978d893 100644
--- a/ack
+++ b/ack
@@ -307,11 +307,9 @@ sub build_regex {

     $str = quotemeta( $str ) if $opt->{Q};
     if ( $opt->{w} ) {
-        my $pristine_str = $str;
-
         $str = "(?:$str)";
-        $str = "\\b$str" if $pristine_str =~ /^\w/;
-        $str = "$str\\b" if $pristine_str =~ /\w$/;
+        $str = "\\b$str";
+        $str = "$str\\b";
     }

     my $regex_is_lc = $str eq lc $str;

See also #14

The reason this is difficult is that \b matches word boundaries (\w vs \W) instead of whitespace boundaries (\S vs \s). In issue #14, we refer to a failing test that attempts ack -w mu., which would produce unexpected behavior on something like mu(, because you'd expect it to match, but it wouldn't. I'm not sure what the right solution is here, but I don't know if removing the check is the right thing to do.

Right, but currently the documentation explicitly says that \b is used, which implies the usual \b semantics of matching word boundaries rather than whitespace boundaries. So I would say that the test with mu. is incorrect, in that it is expecting something different to what the documentation says.

Let's fix the documentation and the test suite to be consistent with each other, and then we have some hope of making the code consistent with both...

Sounds good to me!

Hi, any update on this? I believe it is still outstanding in 2.14 despite this bug being added to the 2.14 milestone.

Someone needs to fix the differences between the docs and the tests. If you'd like to do that and submit a pull request, I'm interested.

Here's a fix which I think makes the behaviour consistent with the test suite, the documentation consistent with the behaviour, and (IMHO) matches the most common user expectations. Sorry, I am still fiddling with my github account so I wasn't able to make a pull request, but it is a patch to one file:

diff --git a/ack b/ack
index 3dee0b0..89d6e74 100644
--- a/ack
+++ b/ack
@@ -316,11 +316,9 @@ sub build_regex {

     $str = quotemeta( $str ) if $opt->{Q};
     if ( $opt->{w} ) {
-        my $pristine_str = $str;
-
         $str = "(?:$str)";
-        $str = "\\b$str" if $pristine_str =~ /^\w/;
-        $str = "$str\\b" if $pristine_str =~ /\w$/;
+        $str = "(?:\\b|(?!\\w))$str";
+        $str = "$str(?:\\b|(?<!\\w))";
     }

     my $regex_is_lc = $str eq lc $str;
@@ -1524,8 +1522,10 @@ Display version and copyright information.

 =item B<-w>, B<--word-regexp>

-Force PATTERN to match only whole words.  The PATTERN is wrapped with
-C<\b> metacharacters.
+Match a whole word only.  In more detail: if the match begins with a
+word character, then there must not be a word character immediately
+before the match.  If the match ends with a word character, there must
+not be a word character immediately after the match.

 =item B<-x>

Hi, have you had a chance to look at the patch?

@epa The patch seems solid to me; I know @petdance wants a sort of code freeze for 2.16, though.

Thanks for reviewing, @hoelzro. I hope the patch can still be applied for 2.16 since it is self-contained and only affects the -w flag.

@hoelzro, thanks for reviewing the patch back in January. I have an updated version of the patch at #558 - do you think you could have a look at that one? Let me know if anything further is needed.

@epa Sorry this has taken so long to review. I'm going to pull @petdance in in #558, since he has the final say when it comes to changing behavior that users may be relying on.

This will have to be fixed in ack 3.

This has been fixed in ack3. This behavior won't change in ack2.