开发者

Java regex always fails

开发者 https://www.devze.com 2023-03-05 15:33 出处:网络
I have a Java regex pattern and a sentence I\'d like to completely match, but for some sentencecs it erroneously fails. Why is this? (for simplicity, I won\'t use my complex regex, but just \".*\")

I have a Java regex pattern and a sentence I'd like to completely match, but for some sentencecs it erroneously fails. Why is this? (for simplicity, I won't use my complex regex, but just ".*")

System.out.println(Pattern.matches(".*", "asdf"));
System.out.println(Pattern.matches(".*", "[11:04:34] <@Aimbotter> 1 more thing"));
System.out.println(Pattern.matches(".*", "[11:04:35] <@Aimbotter> Dialogue: 0,0:00:00.00,0:00:00.00,Default,{Orginal LV,0000,0000,0000,,[???]??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????} "));
System.out.println(Pattern.matches(".*", "[11:04:35] <@Aimbotter> Dialogue: 0,0:00:00.00,0:00:00.00,Default,{Orginal LV,0000,0000,0000,,[???]????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????} "));

Output:

true
true
true
false

Note that the fourth sentence contains 10 unicode c开发者_JAVA技巧ontrol characters \u0085 in between the question marks, which aren't shown by normal fonts. The third and fourth sentences actually contain the same amount of characters!


use

Pattern.compile(".*",Pattern.DOTALL)

if you want . to match control characters. By default it only matches printable characters.

From JavaDoc:

"In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)"

Code in Pattern (there is your \u0085):

/**
 * Implements the Unicode category ALL and the dot metacharacter when
 * in dotall mode.
 */
static final class All extends CharProperty {
boolean isSatisfiedBy(int ch) {
    return true;
}
}

/**
 * Node class for the dot metacharacter when dotall is not enabled.
 */
static final class Dot extends CharProperty {
boolean isSatisfiedBy(int ch) {
    return (ch != '\n' && ch != '\r'
                && (ch|1) != '\u2029'
                && ch != '\u0085');
    }
}


The answer is in the question : 10 unicode control characters \u0085

unicode control characters arent recognized by .* just like \n


Unicode /u0085 is newline - so you have to either add (?s) - dot matches all - to the beginning of your regex or add the flag when compiling the regex.

Pattern.matches("(?s).*", "blahDeBlah\u0085Blah")


The problem I believe is that \u0085 represents a newline. If you want multiline matching you need to use Pattern.MULTILINE or Pattern.DOTALL. It's not the fact it is Unicode - '\n' would fail too.

To use it:Pattern.compile(regex, Pattern.DOTALL).matcher(input).matches()

0

精彩评论

暂无评论...
验证码 换一张
取 消