I have a Java regex pattern and a sentence I'd like to completely match, but for some sentencecs it erroneously fails. Why is this? (for simplicity, I won't use my complex regex, but just ".*")
System.out.println(Pattern.matches(".*", "asdf"));
System.out.println(Pattern.matches(".*", "[11:04:34] <@Aimbotter> 1 more thing"));
System.out.println(Pattern.matches(".*", "[11:04:35] <@Aimbotter> Dialogue: 0,0:00:00.00,0:00:00.00,Default,{Orginal LV,0000,0000,0000,,[???]??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????} "));
System.out.println(Pattern.matches(".*", "[11:04:35] <@Aimbotter> Dialogue: 0,0:00:00.00,0:00:00.00,Default,{Orginal LV,0000,0000,0000,,[???]????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????} "));
Output:
true
true
true
false
Note that the fourth sentence contains 10 unicode c开发者_JAVA技巧ontrol characters \u0085 in between the question marks, which aren't shown by normal fonts. The third and fourth sentences actually contain the same amount of characters!
use
Pattern.compile(".*",Pattern.DOTALL)
if you want . to match control characters. By default it only matches printable characters.
From JavaDoc:
"In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)"
Code in Pattern (there is your \u0085):
/**
* Implements the Unicode category ALL and the dot metacharacter when
* in dotall mode.
*/
static final class All extends CharProperty {
boolean isSatisfiedBy(int ch) {
return true;
}
}
/**
* Node class for the dot metacharacter when dotall is not enabled.
*/
static final class Dot extends CharProperty {
boolean isSatisfiedBy(int ch) {
return (ch != '\n' && ch != '\r'
&& (ch|1) != '\u2029'
&& ch != '\u0085');
}
}
The answer is in the question : 10 unicode control characters \u0085
unicode control characters arent recognized by .* just like \n
Unicode /u0085 is newline - so you have to either add (?s)
- dot matches all - to the beginning of your regex or add the flag when compiling the regex.
Pattern.matches("(?s).*", "blahDeBlah\u0085Blah")
The problem I believe is that \u0085 represents a newline. If you want multiline matching you need to use Pattern.MULTILINE or Pattern.DOTALL. It's not the fact it is Unicode - '\n' would fail too.
To use it:Pattern.compile(regex, Pattern.DOTALL).matcher(input).matches()
精彩评论