开发者

How does this Java regex detect palindromes?

开发者 https://www.devze.com 2023-01-15 00:55 出处:网络
This is the third part in a series of educational regex articles. It follows How does this regex find triangular numbers? (where nested references is first introduced) and How can we match a^n b^n wi

This is the third part in a series of educational regex articles. It follows How does this regex find triangular numbers? (where nested references is first introduced) and How can we match a^n b^n with Java regex? (where the lookahead "counting" mechanism is further elaborat开发者_StackOverflowed upon). This part introduces a specific form of nested assertion, which when combined with nested references allows Java regex to match what most people believe is "impossible": palindromes!!

The language of palindromes is non-regular; it's actually context-free (for a given alphabet). That said, modern regex implementation recognizes more than just regular languages, and Perl/PCRE's recursive patterns and .NET's balancing groups can readily recognize palindromes (see: Related Questions).

However, Java's regex engine supports neither of these "advanced" features. And yet "someone" (*wink*) managed to write the following regex which seems to do the job just fine (see also on ideone.com):

public class Palindrome {
    // asserts that the entirety of the string matches the given pattern
    static String assertEntirety(String pattern) {
        return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
    }

    public static void main(String[] args) {
        final String PALINDROME =
            "(?x) | (?:(.) add)+ chk"
                .replace("add", assertEntirety(".*? (\\1 \\2?)"))
                .replace("chk", assertEntirety("\\2"));

        System.out.println(PALINDROME);
        // (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)
        
        String[] tests = {
            "",     // true
            "x",    // true
            "xx",   // true
            "xy",   // false
            "xyx",  // true
            "xxx",  // true
            "xxyx", // false
            "racecar",                // true
            "step on no pets",        // true
            "aManaPlanaCanalPanaMa",  // true
            "this is impossible",     // FALSE!!!
        };
        for (String test : tests) {
            System.out.printf("[%s] %s%n", test, test.matches(PALINDROME));
        }
    }
}

So this seems to work, but how?

References

  • java.util.regex.Pattern
  • regular-expressions.info/Freespacing (?x), Lookarounds (?=…)/(?<=…), etc.

COMMON SENSE ALERT!!!

This is not the best way to detect palindromes; it's O(N^3)at best. Performing this detection in a more general purpose programming language is both more efficient and more straightforward.

You wouldn't want to use regex to detect palindromes for the same reasons you wouldn't want to use regex to find prime numbers. That said, you would study how a non-recursive non-balancing group regex can detect palindromes for the same reasons you would study how a regex can be used for primality testing: it's fun, it's challenging, it's educational.

Related questions

  • How to check that a string is a palindrome using regular expressions? - it's "impossible"! (unless...)
  • How to check if the given string is palindrome? - non-regex solutions in many languages
  • How to determine if a number is a prime with regex?


The Big Picture

We will first look at this regex from the overall big picture algorithm, and then take a closer look at the specific implementation details later. The regex is an almost direct translation of the following Java code:

static boolean isPalindrome(String s) {
   if (s.isEmpty()) {
      return true;
   }
   String g2 = null;
   for (char ch : s.toCharArray()) {
      String g1 = String.valueOf(ch);
      // "add"
      if (g2 != null && s.endsWith(g1 + g2)) {
         g2 = g1 + g2;
      } else if (s.endsWith(g1)) {
         g2 = g1;
      } else {
         break;
      }
   }
   return s.equals(g2); // "chk"
}

This is obviously not the most straightforward/efficient Java code to check for palindromes, but it works, and most fascinatingly, it's almost directly translatable to regex with a near one-to-one mapping. Here's the regex again, reproduced here for convenience, annotated to highlight the striking resemblance:

//  isEmpty  _for-loop_
//       ↓  /          \
    "(?x) | (?:(.) add)+ chk"
//             \_/  ↑
//             g1   loop body                   ___g2___
//                                             /        \
           .replace("add", assertEntirety(".*? (\\1 \\2?)"))
           .replace("chk", assertEntirety("\\2"));
                           // s.equals(g2)

Attachment: annotated and expanded version of the source code on ideone.com

(Feel free to ignore the details of assertEntirety for now: just think of it as a black box regex mechanism that can make an assertion on the entire string regardless of where we currently are.)

So the basic algorithm is that we try to build a suffix, subject to a palindromic constraint, as we scan the string left to right. We then check if we're able to build the complete string in this manner. If we can, then the string is a palindrome. Also, as a special case, the empty string is trivially a palindrome.

Once the big picture algorithm is understood, we can examine how the regex pattern implements it.


What's with all the String.replace?

Regex patterns in Java are ultimately nothing but strings, meaning they can be derived through string manipulations the way any string can. Yes, we can even use regex to generate a regex pattern -- a sort of meta-regexing approach, if you will.

Consider this example of initializing an int constant (which ultimately contains nothing but a number):

final int X = 604800;
final int Y = 60 * 60 * 24 * 7;
// now X == Y

The number assigned to X is a literal integer: we can clearly see what the number is. This is not the case with Y which uses an expression instead, and yet this formula seems to convey an idea of what this number represents. Even without proper naming of these constants, we nonetheless get the idea that Y probably represents the number of seconds in a week, even if we may not immediately know what the numeric value is. On the other hand, with X we know precisely that number, but we get less of an idea of what it represents.

The use of string replacements in the snippet is an analogous situation but for strings regex patterns. Instead of explicitly writing the pattern as one literal string, sometimes systematic and logical derivation (the "formula") of that value from simpler parts can be much more meaningful. This is especially true with regex, where it often matters more that we understand what the pattern does than being able to see what it looks like as a string literal (which isn't much of a looker anyway, what with all the extra backslashes).

A portion of the snippet is reproduced here again for convenience:

// the "formula"
     final String PALINDROME =
        "(?x) | (?:(.) add)+ chk"
           .replace("add", assertEntirety(".*? (\\1 \\2?)"))
           .replace("chk", assertEntirety("\\2"));

// the "value"
     System.out.println(PALINDROME);
     //                       ____add_____             chk_
     //               _______/            \____   _______/ \_____
     // (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)
     //        |  \_/             \______/     |
     //        |   1                 2         |
     //        |_______________________________|

Without a doubt the "formula" is a lot more readable than the eventual string "value" in this case.

There are certainly much more sophisticated ways to programmatically generate a regex pattern, and it's certainly possible to write in such a way that obfuscates instead of accentuates its meaning, but mindful usage of even simple string replacements can still do wonder (as hopefully shown in this example).

Lesson: Consider programmatic generation of regex patterns.


How does add work?

The (?:(.) add)+ construct, where add is an assertion that does some sort of "counting", has already been thoroughly discussed in the previous two parts. Two features are worth noting, though:

  • The (.) captures into group 1, allowing backreference later on
  • The assertion is assertEntirety instead of just looking ahead from our current position
    • We'll discuss this in more detail later; just think of it as a way to assert on the entire string

The pattern applied to assertEntirety in add is the following:

# prefix   _suffix_
#    ↓    /        \
    .*?   ( \1 \2? )
#         \________/   i.e. a reluctant "whatever" prefix (as short as possible)
#          group 2          followed by a suffix captured into group 2

Note that group 2 is self-referencing with an optional specifier, a technique already discussed in part 2 of the series. Needless to say group 2 is our "counter" in this pattern: it's a suffix that we will try to grow leftward on every iteration of the "loop". As we iterate on each (.) left to right, we try to prepend that same character (using backreference to \1) to our suffix.

Recall again the Java code translation of the above pattern, reproduced here for convenience:

if (g2 != null && s.endsWith(g1 + g2)) {   // \2? is greedy, we try this first
   g2 = g1 + g2;
} else if (s.endsWith(g1)) {    // since \2? is optional, we may also try this
   g2 = g1;
} else {        // if there's no matching suffix, we "break" out of the "loop"
   break;
}

The fact that \2? is optional means a few things:

  • It provides a "base case" for the self-reference (the main reason we do this!)
  • Since \2? is part of the suffix pattern (and thus appears later in the overall pattern), the prefix part must be reluctant, hence .*? instead of .*. This allows \2? to exercise its greediness.
  • The "counter" may also "reset" and give the "wrong" result
    • In part 2 we showed how backtracking ? may result in the same kind of problematic resetting
      • We solved the problem by using possessive quantifier ?+, but this is not applicable here

The third point is elaborated further in the next section.

Lesson: Carefully analyze the interactions between greedy/reluctant repetitions in parts of a pattern.

Related questions

  • Difference between .*? and .* for regex
  • Regular expression: who's greedier?

Why do we need a chk phase?

As alluded in the previous section, the optional and backtrackable \2? means that our suffix can shrink under some circumstances. We will examine such a scenario step by step for this input:

 x y x y z y x
↑
# Initial state, \2 is "uninitialized"
             _
(x)y x y z y x
  ↑
  # \1 captured x, \2 couldn't match \1\2 (since \2 is "uninitialized")
  #                but it could match \1 so it captured x
           ___
 x(y)x y z y x
    ↑
    # \1 captured y, \2 matched \1\2 and grew to capture yx
             _
 x y(x)y z y x
      ↑
      # \1 captured x, \2 couldn't match \1\2,
      #                but it could match \1 so it shrunk to capture x (!!!)
           ___
 x y x(y)z y x
        ↑
        # \1 captured y, \2 matched \1\2 and grew to capture yx
         _____
 x y x y(z)y x
          ↑
          # \1 captured z, \2 matched \1\2 and grew to capture zyx
       _______
 x y x y z(y)x
            ↑
            # \1 captured y, \2 matched \1\2 and grew to capture yzyx
     _________
 x y x y z y(x)
              ↑
              # \1 captured x, \2 matched \1\2 and grew to capture xyzyx

We can modify our pattern (and the corresponding Java code) to omit the chk phase, and see that indeed this is what happens:

    // modified pattern without a chk phase; yields false positives!
    final String PALINDROME_BROKEN =
        "(?x) | (?:(.) add)+"
            .replace("add", assertEntirety(".*? (\\1 \\2?)"));

    String s = "xyxyzyx"; // NOT a palindrome!!!
    
    Matcher m = Pattern.compile(PALINDROME_BROKEN).matcher(s);
    if (m.matches()) {
        System.out.println(m.group(2)); // prints "xyzyx"
    }

As explained, "xyxyzyx", which is NOT a palindrome, is falsely reported as one, because we didn't check if the growing suffix eventually became the complete string (which it clearly did not in this case). The chk phase (which is an assertEntirety of the pattern \2) is therefore an absolute necessity in our setup. We need to confirm that indeed we managed to grow our suffix all the way. If this is the case, then we have ourselves a palindrome.

Lesson: Carefully analyze any possibly unintended side-effects of optional self-reference matching.


The Main Course: assertEntirety

While it's neat that we can write a Java regex pattern to detect palindromes, everything here except assertEntirety has already been covered in the previous parts of the series. The only new thing here is this mysterious black box, this powerful mechanism that magically allowed us to do what is otherwise "impossible".

The assertEntirety construct is based on the following meta-pattern of nested lookarounds:

(?<=(?=^pattern$).*)

" I can see a place somewhere behind me where I can look ahead and see ^pattern$ "

The name "lookaround" implies the relativity to our current position: we're looking around us, perhaps in front of or behind, from where we're standing. By nesting a lookahead in a lookbehind in this manner, we're able to metaphorically "fly into the skies" and look at the entire picture.

Abstracting this meta-pattern into assertEntirety is a bit like writing preprocessing substitution macros. Having nested lookarounds everywhere probably hurts readability and maintainability, so we encapsulate it into assertEntirety, which not only hides the complexity of its inner workings, but also further emphasizes its semantics by giving it an appropriate name.

Lesson: Consider abstracting meta-patterns to hide complexity and convey semantics.


Appendix: On infinite-length lookbehind in Java

Observant readers would notice that assertEntirety contains a .* in a lookbehind, which makes its theoretical maximum length infinite. No, Java does not officially support infinite-length lookbehind. Yes, as it has been adequatedly demonstrated here, it works anyway. Officially it's categorized as a "bug"; but "someone"(*wink*) could also consider it a "hidden feature".

It's certainly possible that this "bug" will be "fixed" in the future. Removal of this hidden feature will break this particular solution to the Java regex palindrome problem.

Related questions

  • Regex look-behind without obvious maximum length in Java
0

精彩评论

暂无评论...
验证码 换一张
取 消