-w does not interact well with metacharacters
epa opened this issue · comments
-w is documented as matching whole words only, and that it works by wrapping the pattern in \b metacharacters. Given this, you would expect two equivalent regexps to give the same behaviour, e.g.
abc
(?:abc)
These are exactly equivalent perl5 regular expressions, but they behave differently with -w:
% mkdir n; cd n
% echo abcde >foo
% ack -w 'abc'
% ack -w '(?:abc)'
foo
1:abcde
The fix is quite simple: at present the code doesn't add the \b anchor at the start if the regexp doesn't start with a word character, ditto end. That doesn't match the documentation, which says that the regexp is wrapped with \b unconditionally. Patch to make behaviour match docs:
diff --git a/ack b/ack
index 38ba811..978d893 100644
--- a/ack
+++ b/ack
@@ -307,11 +307,9 @@ sub build_regex {
$str = quotemeta( $str ) if $opt->{Q};
if ( $opt->{w} ) {
- my $pristine_str = $str;
-
$str = "(?:$str)";
- $str = "\\b$str" if $pristine_str =~ /^\w/;
- $str = "$str\\b" if $pristine_str =~ /\w$/;
+ $str = "\\b$str";
+ $str = "$str\\b";
}
my $regex_is_lc = $str eq lc $str;
The reason this is difficult is that \b
matches word boundaries (\w
vs \W
) instead of whitespace boundaries (\S
vs \s
). In issue #14, we refer to a failing test that attempts ack -w mu.
, which would produce unexpected behavior on something like mu(
, because you'd expect it to match, but it wouldn't. I'm not sure what the right solution is here, but I don't know if removing the check is the right thing to do.
Right, but currently the documentation explicitly says that \b is used, which implies the usual \b semantics of matching word boundaries rather than whitespace boundaries. So I would say that the test with mu. is incorrect, in that it is expecting something different to what the documentation says.
Let's fix the documentation and the test suite to be consistent with each other, and then we have some hope of making the code consistent with both...
Sounds good to me!
Hi, any update on this? I believe it is still outstanding in 2.14 despite this bug being added to the 2.14 milestone.
Someone needs to fix the differences between the docs and the tests. If you'd like to do that and submit a pull request, I'm interested.
Here's a fix which I think makes the behaviour consistent with the test suite, the documentation consistent with the behaviour, and (IMHO) matches the most common user expectations. Sorry, I am still fiddling with my github account so I wasn't able to make a pull request, but it is a patch to one file:
diff --git a/ack b/ack
index 3dee0b0..89d6e74 100644
--- a/ack
+++ b/ack
@@ -316,11 +316,9 @@ sub build_regex {
$str = quotemeta( $str ) if $opt->{Q};
if ( $opt->{w} ) {
- my $pristine_str = $str;
-
$str = "(?:$str)";
- $str = "\\b$str" if $pristine_str =~ /^\w/;
- $str = "$str\\b" if $pristine_str =~ /\w$/;
+ $str = "(?:\\b|(?!\\w))$str";
+ $str = "$str(?:\\b|(?<!\\w))";
}
my $regex_is_lc = $str eq lc $str;
@@ -1524,8 +1522,10 @@ Display version and copyright information.
=item B<-w>, B<--word-regexp>
-Force PATTERN to match only whole words. The PATTERN is wrapped with
-C<\b> metacharacters.
+Match a whole word only. In more detail: if the match begins with a
+word character, then there must not be a word character immediately
+before the match. If the match ends with a word character, there must
+not be a word character immediately after the match.
=item B<-x>
Hi, have you had a chance to look at the patch?
Thanks for reviewing, @hoelzro. I hope the patch can still be applied for 2.16 since it is self-contained and only affects the -w flag.
This will have to be fixed in ack 3.
This has been fixed in ack3. This behavior won't change in ack2.