hamcrest / JavaHamcrest

Java (and original) version of Hamcrest

Home Page:http://hamcrest.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IsBlankString matcher is inconsistent with String.isBlank()

Thorn1089 opened this issue · comments

The naming of the matcher implies it would give consistent results with String.isBlank(). However, it does not. The matcher uses a regex with pattern \\s* which will only match ASCII whitespace characters. String.isBlank() eventually delegates to Character.isWhitespace() which checks for Unicode whitespace as well. See https://stackoverflow.com/a/57998178 for a detailed breakdown, summarized below:

But further, there are different definitions of white-space

Character.isWhitespace:

Determines if the specified character (Unicode code point) is white space according to Java. A character is a Java whitespace character if and only if it satisfies one of the following criteria:

It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
It is '\t', U+0009 HORIZONTAL TABULATION.
It is '\n', U+000A LINE FEED.
It is '\u000B', U+000B VERTICAL TABULATION.
It is '\f', U+000C FORM FEED.
It is '\r', U+000D CARRIAGE RETURN.
It is '\u001C', U+001C FILE SEPARATOR.
It is '\u001D', U+001D GROUP SEPARATOR.
It is '\u001E', U+001E RECORD SEPARATOR.
It is '\u001F', U+001F UNIT SEPARATOR.
the \s pattern (by default):

\s A whitespace character: [ \t\n\x0B\f\r]

I discovered this when using JUnit Quickcheck with an assumption to prevent handling whitespace strings of the form:
Assume.assumeThat(string, not(blankString()));
The assumption passed, and my code would check the string via String.isBlank() and fail the test.

For those of us with more experience it doesn't imply anything at all, because "blank" is such an ambiguous concept that any method in a library claiming to detect "blankness" is immediately banned for use by anyone who cares about what they're letting into their project.

My own Hamcrest cheat sheet marks them with skull marks for this reason:

In fact, "blank" does have a precise meaning in the Java ecosystem as I pointed out in my original post -- it should be consistent with the isBlank and isWhitespace methods of String and Character respectively.

I did learn from the associated PR that Hamcrest still supports older versions of Java that predate the String#isBlank convenience method; however, Character#isWhitespace has been present since 1.5 for Unicode code points and 1.1 for char values. I think it entirely appropriate that the IsBlank matcher be consistent with those results.

If you don't want to use the IsBlank matcher in your code, that's your decision, but it is a non-deprecated part of the Hamcrest library and as such I'd consider this a deficiency that can be fixed.

No, in the "Java ecosystem" you speak of, there are multiple definitions of whitespace. Just because you only know of one of them does not mean that the others do not exist.

Kevin Bourrillion compiled this spreadsheet:

https://docs.google.com/spreadsheets/d/1kq4ECwPjHX9B8QUCTPclgsDCXYaj7T-FlT4tB5q3ahk/edit#gid=0

You're right that it's a deficiency that can be fixed, but perhaps it should be fixed by removing the method.

Note that String.isBlank() was only introduced relatively recently in Java 11.
The Hamcrest sources predate that, and still build with Java 1.7 at the time of writing.

@peterdemaeyer understood, and I have updated the linked PR to remain compatible with that JDK.

I do think that since we're currently in rampdown for JDK16, that it's reasonable that devs will start using String#isBlank (I could have sworn this made it in in JDK8 for some reason -- guess JDK11 has been out even longer than I recalled at first) and make the same mistake.

As mentioned previously, Character#isWhitespace (which is what the convenience method ultimately boils down to using) dates back to 1.5, so I think it's reasonable that the implementation use that rather than the regex pattern \s (which I believe is ASCII-only to match PCRE and other older regex engines).