开发者

How does this regex replacement reverse a string?

开发者 https://www.devze.com 2023-01-16 07:12 出处:网络
This is the fourth part in a series of educational regex articles. It show how the combination of nested reference (see: How does this regex find triangular numbers?) to "count" within asse

This is the fourth part in a series of educational regex articles. It show how the combination of nested reference (see: How does this regex find triangular numbers?) to "count" within assertions (see: How can we match a^n b^n with Java regex?) can be used to reverse a string. The programmatically generated pattern uses meta-pattern abstractions (see: How does this Java regex detect palindromes?). For the first time in the series, these techniques are used for replacement instead of whole string matching.

Complete working Java and C# implementations are provided. Inspirational quotes included.

Reversing a string using regular expressions never seemed like a good idea, nor was it even immediately obvious if it was at all possible, and if so, how one might attempt to do so.

While it's still not a good idea, at least now we know that it's possible, because here's one way to do it:

C# (also on ideone.com)

using System;
using System.Text.RegularExpressions;

public class TwoDollarReversal {    
public static void Main() {
   string REVERSE = 
      @"(?sx) . grab$2"
         .Replace("grab$2",
            ForEachDotBehind(
               AssertSuffix(@"((.) \1?)")
            )
         );
   Console.WriteLine(
      Regex.Replace(
         @"nietsniE treblA --
         hguone llew ti dnatsrednu t'nod uoy ,ylpmis ti nialpxe t'nac uoy fI",

         REVERSE, "$2"
      )
   );
   // If you can't explain it simply, you don't understand it well enough
   // -- Albert Einstein
}      
// performs an assertion for each dot behind current position
static string ForEachDotBehind(string assertion) {
   return "(?<=(?:.assertion)*)".Replace("assertion", assertion);
}
// asserts that the suffix of the string matches a given pattern
static string AssertSuffix(string pattern) {
   return "(?=.*$(?<=pattern))".Replace("pattern", pattern);
}

}

Java (also on ideone.com)

class TwoDollarReversal {

public static void main(String[] args) {
   String REVERSE =
      "(?sx) . grab$2"
         .replace("grab$2",
            forEachDotBehind(
               assertSuffix("((.) \\1?)")
            )
         );

   System.out.println(
      "taerG eht rednaxelA --\nyrt lliw ohw mih ot elbissopmi gnihton si erehT"
         .replaceAll(REVERSE, "$2")
   );
   // There is nothing impossible to him who will try
   // -- Alexander the Great"
}

static String forEachDotBehind(String assertion) {
   return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
}
static String assertSuffix(String pattern) {
   return "(?<=(?=^.*?pattern$).*)".replace("pattern", pattern);
}

}

Both the C# and Java versions seem to use the same overall algorithm, with minor variations only in the abstracted implementation details.

Clearly this is not the best, most straightforward, most efficient way to reverse a 开发者_如何学Cstring. That said, in the interest of learning about regex; how to conceptualize patterns; how the engine works to match them; how to put various parts together to build what we want; how to do so in a way that is readable and maintainable; and just for the sheer joy of learning something new, can we have an explanation of how this works?


Appendix: Cheat sheet!

This is a brief description of the basic regex constructs used:

  • (?sx) is the embedded flag modifiers. s enables the "single-line" mode, allowing the dot to match ANY character (including newlines). x enables the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
  • ^ and $ are the beginning and end-of-the-line anchors.
  • ? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
  • (…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows back/forward/nested references (e.g. \1), replacement substitution (e.g. $2), etc.
  • (?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern. (?<=…) is a positive lookbehind; it looks to the left.

Language references/additional resources

  • MSDN - Regular Expression Language Elements -- System.Text.RegularExpressions
  • Java Tutorials/Essential classes/Regular expressions -- java.util.regex.Pattern


Overview

At a high level, the pattern matches any one character ., but additionally performs a grab$2 action, which captures the reversal "mate" of the character that was matched into group 2. This capture is done by building a suffix of the input string whose length matches the length of the prefix up to the current position. We do this by applying assertSuffix on a pattern that grows the suffix by one character, repeating this once forEachDotBehind. Group 1 captures this suffix. The first character of that suffix, captured in group 2, is the reversal "mate" for the character that was matched.

Thus, replacing each matched character with its "mate" has the effect of reversing a string.


How it works: a simpler example

To better understand how the regex pattern works, let's first apply it on a simpler input. Also, for our replacement pattern, we'll just "dump" out all the captured strings so we get a better idea of what's going on. Here's a Java version:

System.out.println(
    "123456789"
        .replaceAll(REVERSE, "[$0; $1; $2]\n")
);

The above prints (as seen on ideone.com):

[1; 9; 9]
[2; 89; 8]
[3; 789; 7]
[4; 6789; 6]
[5; 56789; 5]
[6; 456789; 4]
[7; 3456789; 3]
[8; 23456789; 2]
[9; 123456789; 1]

Thus, e.g. [3; 789; 7] means that the dot matched 3 (captured in group 0), the corresponding suffix is 789 (group 1), whose first character is 7 (group 2). Note that 7 is 3's "mate".

                   current position after
                      the dot matched 3
                              ↓        ________
                      1  2 [3] 4  5  6 (7) 8  9
                      \______/         \______/
                       3 dots        corresponding
                       behind      suffix of length 3

Note that a character's "mate" may be to its right or left. A character may even be its own "mate".


How the suffix is built: nested reference

The pattern responsible for matching and building the growing suffix is the following:

    ((.) \1?)
    |\_/    |
    | 2     |       "suffix := (.) + suffix
    |_______|                    or just (.) if there's no suffix"
        1

Note that within the definition of group 1 is a reference to itself (with \1), though it is optional (with ?). The optional part provides the "base case", a way for the group to match without the reference to itself. This is required because an attempt to match a group reference always fails when the group hasn't captured anything yet.

Once group 1 captures something, the optional part is never exercised in our setup, since the suffix that we just captured last time will still be there this time, and we can always prepend another character to the beginning of this suffix with (.). This prepended character is captured into group 2.

Thus this pattern attempts to grow the suffix by one dot. Repeating this once forEachDotBehind will therefore results in a suffix whose length is exactly the length of the prefix up to our current position.


How assertSuffix and forEachDotBehind work: meta-pattern abstractions

Note that so far we've treated assertSuffix and forEachDotBehind as blackboxes. In fact, leaving this discussion for last is a deliberate act: the names and the brief documentation suggest WHAT they do, and this was enough information for us to write and read our REVERSE pattern!

Upon closer inspection, we see that the Java and C# implementations of these abstractions slightly differ. This is due to the differences between the two regex engines.

The .NET regex engine allows full regular expression in a lookbehind, so these meta-patterns look a lot more natural in that flavor.

  • AssertSuffix(pattern) := (?=.*$(?<=pattern)), i.e. we use a lookahead to go all the way to the end of the string, then use a nested lookbehind to match the pattern against a suffix.
  • ForEachDotBehind(assertion) := (?<=(?:.assertion)*), i.e. we simply match .* in a lookbehind, tagging the assertion along with the dot inside a non-capturing group.

Since Java's doesn't officially support infinite-length lookbehind (but it works anyway under certain circumstances), its counterpart is a bit more awkward:

  • assertSuffix(pattern) := (?<=(?=^.*?pattern$).*), i.e. we use a lookbehind to go all the way to the beginning of the string, then use a nested lookahead to match the entire string, prepending the suffix pattern with .*? to reluctantly match some irrelevant prefix.
  • forEachDotBehind(assertion) := (?<=^(?:.assertion)*?), i.e. we use an anchored lookbehind with reluctant repetition, i.e. ^.*? (and likewise tagging the assertion along with the dot inside a non-capturing group).

It should be noted that while the C# implementation of these meta-patterns doesn't work in Java, the Java implementation DOES work in C# (see on ideone.com). Thus, there is no actual need to have different implementations for C# and Java, but the C# implementation deliberately took advantage of the more powerful .NET regex engine lookbehind support to express the patterns more naturally.

We have thus shown the benefits of using meta-pattern abstractions:

  • We can independently develop, examine, test, optimize, etc. these meta-patterns implementations, perhaps taking advantage of flavor-specific features for extra performance and/or readability.
  • Once these building blocks are developed and well-tested, we can simply use them as parts of a bigger pattern, which allows us to express ideas at higher levels for more readable, more maintainable, more portable solutions.
  • Meta-patterns promote reuse, and programmatic generation means there's less duplication

While this particular manifestation of the concept is rather primitive, it's also possible to take this further and develop a more robust programmatic pattern generation framework, with a library of well-tested and optimized meta-patterns.

See also

  • Martin Fowler - Composed Regex
  • .NET regular expressions - Balancing group definition - a great example of a meta-pattern!

Closing thoughts

It needs to be reiterated that reversing a string with regex is NOT a good idea in practice. It's way more complicated than necessary, and the performance is quite poor.

That said, this article shows that it CAN in fact be done, and that when expressed at higher levels using meta-pattern abstractions, the solution is in fact quite readable. As a key component of the solution, the nested reference is showcased once again in what is hopefully another engaging example.

Less tangibly, perhaps the article also shows the determination required to solve a problem that may seem difficult (or even "impossible") at first. Perhaps it also shows the clarity of thought that comes with a deeper understanding of a subject matter, a result of numerous studies and hard work.

No doubt regex can be an intimidating subject, and certainly it's not designed to solve all of your problems. This is no excuse for hateful ignorance, however, and this is one surprisingly deep well of knowledge if you're willing to learn.


using regex to reverse is not as complex as some think - just do it by powers of 2, with a few loops that is essentially a rehash of the typical bit-reversal trick applied over arbitrary strings, so # of loop cycles needed is simply the integer log-base-2 of the string length, instead of one character at a time :

 LANG="en_US.UTF-8" echo '1234567890123\nABCDefghi\n코요태이하픽에' | 
 LANG="en_US.UTF-8" gawk -e '

   # gawk profile, created Sun Dec  4 17:22:55 2022

   # Functions, listed alphabetically

 5  function _____(__,___,_,____,______)
    {
 5      if ((___=(_=_<_)<+___ ? +___ : length(__))<=(++_+_)) {

            return substr((__)__,___,___*=_<=___)
        }

16      do { ++____ } while ((_+=_)<___)

 5      if (___<_) { # 5

 5        ______ = (___-(_=(_^!_+_^!_)^—____))<=_^!_ \
                   ?       substr(__,___) \
                   : _____(substr(__,_+!!_),___-_)

 5            __ = substr(__,!!_,___=_)
         }
 5       ___ = "."
11       while (____--) {

11            sub("$", "\3", __)
             gsub("(^|" (___)  (")")___, "\3&\5&", __)
             gsub("\5|(^|" (___) ")\3("(___)"|$)", "",__)

11            ___=(___)___
         }
 5       return (______)__
      }

      BEGIN {
 1       OFS = "\f\r\t"
      }

 3    ($++NF = _____($!_))^_'
 1  1234567890123
    3210987654321

 2  ABCDefghi
    ihgfeDCBA

 3  코요태이하픽에
    에픽하이태요코
0

精彩评论

暂无评论...
验证码 换一张
取 消