codingbat wordEnds using regex_问答_开发者_运维开发者技术经验分享

I'm trying to solve wordEnds from codingbat.com using regex.

Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
wordEnds("XYXY", "XY") → "XY"

This is the simplest as I can make it with my current knowledge of regex:

public String wordEnds(String str, String word) {
  return str.replaceAll(
     ".*?(?=word)(?<开发者_如何学Python=(.|^))word(?=(.|$))|.+"
       .replace("word", java.util.regex.Pattern.quote(word)),
     "$1$2"
  );
}

replace is used to place in the actual word string into the pattern for readability. Pattern.quote isn't necessary to pass their tests, but I think it's required for a proper regex-based solution.

The regex has two major parts:

If after matching as few characters as possible ".*?", word can still be found "(?=word)", then lookbehind to capture any character immediately preceding it "(?<=(.|^))", match "word", and lookforward to capture any character following it "(?=(.|$))".
- The initial "if" test ensures that the atomic lookbehind captures only if there's a word
- Using lookahead to capture the following character doesn't consume it, so it can be used as part of further matching
Otherwise match what's left "|.+"
- Groups 1 and 2 would capture empty strings

I think this works in all cases, but it's obviously quite complex. I'm just wondering if others can suggest a simpler regex to do this.

Note: I'm not looking for a solution using indexOf and a loop. I want a regex-based replaceAll solution. I also need a working regex that passes all codingbat tests.

I managed to reduce the occurrence of word within the pattern to just one.

".+?(?<=(^|.)word)(?=(.?))|.+"

I'm still looking if it's possible to simplify this further, but I also have another question:

With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?

Based on your solution I managed to simplify the code a little bit:

public String wordEnds(String str, String word) {
  return str.replaceAll(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+","$1$2");
}

Another way of writing it would be:

public String wordEnds(String str, String word) {
  return str.replaceAll(
     String.format(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+",word),
     "$1$2");
}

With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?

In Oracle's implementation, the behavior of look-behind is as follow:

By "studying" the regex (with study() method in each node), it knows the maximum length and minimum length of the pattern in look-behind group. (The study() method is what allows for obvious look-behind length)
It verifies the look-behind by starting a match at every position from index (current - min_length) to position (current - max_length) and exits early if the condition is satisfied.

Effectively, it will try to verify the look-behind on the shortest string first.

The implementation multiplies the matching complexity by O(k) factor.

This explains why changing ^|. to .? doesn't work: due to the starting position, it effectively checks for word before .word. The quantifier doesn't have a say here, since the ordering is imposed by the match range.

You can check the code of match method in Pattern.Behind and Pattern.NotBehind inner classes to verify what I said above.

In .NET's flavor, look-behind is likely implemented by the reverse matching feature, which means that no extra factor is incurred on the matching complexity.

My suspicion comes from the fact that the capturing group in (?<=(a+))b matches all a's in aaaaaaaaaaaaaab. The quantifier is shown to have free reign in look-behind group.

I have tested that ^|. can be simplified to .? in .NET and the regex works correctly.

I am working in .NET's regex but I was able to change your pattern to:

.+?(?<=(\w?)word)(?=(\w?))|.+

with the positive results. You know its a word (alphanumeric) type character, why not give a valid hint to the parser of that fact; instead of any character its an optional alpha numeric character.

It may answer why you don't need to specify the anchors of ^ and $, for what exactly is $ - is it \r or \n or other? (.NET has issues with $, and maybe you are not exactly capturing a Null of $, but the null of \r or \n which allowed you to change to .? for $)

Another solution to look at...

public String wordEnds(String str, String word) {
  if(str.equals(word)) return "";
  int i = 0;
  String result = "";
  int stringLen = str.length();
  int wordLen = word.length();
  int diffLen = stringLen - wordLen;
  
  while(i<=diffLen){
    if(i==0 && str.substring(i,i+wordLen).equals(word)){
      result = result + str.charAt(i+wordLen);
    }else if(i==diffLen && str.substring(i,i+wordLen).equals(word)){
      result = result + str.charAt(i-1);
    }else if(str.substring(i,i+wordLen).equals(word)){
      result = result + str.charAt(i-1) + str.charAt(i+wordLen) ;
    }
    
    i++;
  }
  
  if(result.length()==1) result = result + result;
  
  return result;
}

Another possible solution:

public String wordEnds(String str, String word) {
  String result = "";
  
  if (str.contains(word)) {
    for (int i = 0; i < str.length(); i++) {
      if (str.startsWith(word, i)) {
        if (i > 0) {
        result += str.charAt(i - 1);
        }
        if ((i + word.length()) < str.length()) {
        result += str.charAt(i + word.length());
        }
      }
    }
  }
  
  return result;
}