开发者

Pattern Parsing Java

开发者 https://www.devze.com 2023-02-25 06:30 出处:网络
Pretend my goal in a program is to parse as many occurrences of \"ab\" out of a string as I can.I approach this problem with the following code:

Pretend my goal in a program is to parse as many occurrences of "ab" out of a string as I can. I approach this problem with the following code:

public static void main(String[] args)
{
    final String expression = "^(\\s*ab)";

    Scanner scanner开发者_运维技巧 = new Scanner("ab abab  ab");

    while (scanner.hasNext())
    {
        String next = scanner.findWithinHorizon(expression, 0);

        if (next == null)
        {
            System.out.println("FAIL");
            break;
        }
        else
        {
            System.out.println(next);
        }
    }
}

The caret at the beginning of the expression is to disallow anything but whitespace at the beginning of each read as mentioned here. It's there to prevent something like "cab" or "c ab" from being allowed. In fact, I would expect null to be returned and FAIL to be printed to the console if one of these two cases occur. If I remove the caret from the expression, it works perfectly fine on input such as "ab abab ab", but fails to return null for "c ab". On the other hand, if I leave the caret, then "c ab" returns null as expected but "ab abab ab" fails. How can I make this work?

Edit

My original post may have been a little vague. The example I gave above is a simpler version of my real problem. the pattern ab is a filler pattern I would replace with something more interesting, say an email address regex or a hexadecimal value.

In my application, the input to the scanner is not a string, but an input stream of which I have no knowledge. My goal in the loop is to read in values one at a time from the input and verify their contents match some pattern. If they do, then I could do something more interesting with them. If not, then the program terminates.

In the above example, I would expect an input of ab abab ab to output:

ab
 ab
ab
  ab

I would expect c ab to output:

FAIL

and I would expect ab cab to output:

ab
FAIL


In the other thread you wanted to match the first occurence of ab so the caret was fine. If you want to match every occurence of ab until another character occurs, try this expression: String expression = "\\G(\\s*ab)";

The \G means that the next match should start at the position the previous stopped at.

If I use that with your code I get the following results:

  1. Input = "ab abab ab" , Output = "ab", " ab", "ab", " ab"

  2. Input = "cab abab ab", Output = "FAIL"

  3. Input = "ab c abab ab", Output = "ab", "FAIL"

  4. Input = "ab abab abc", Output = "ab", " ab", "ab", " ab", "FAIL"


Well... I think you may do this with one call of regex

Try the following pattern:

expression = "^(\\s*ab*)*$";


If I've gotten your question right, the fault is in the expression. If you always want a white space in the beginning you should use ^(\s+) and not ^(\s*) as * can be 0 occurrences while + mean at least one.


Please understand that findWithinHorizon method in Scanner is for finding the next occurrence of a pattern constructed from the specified string and NOT for matching the whole input. If you write a regex that matched whole input then it will just return the input text as is (as per VMykyt's answer here). But that is not you want as I understand.

So you need to make a separate call to String#matches method to make sure there is nothing but spaces in front of your text and if it matches then just find all ab ocurrances.

Consider this minor change in your code:

public static void main(String[] args) {
   matchIt("ab abab  ab");
   matchIt("c ab");
   matchIt("cab");
}

private static void matchIt(String str) {
   final String expression = "ab";
   System.out.println("Input: [" + str + ']');
   Scanner scanner = new Scanner(str);

   if(str.matches("^\\s*ab.*$")) {
      while (scanner.hasNext()) {
         String next = scanner.findWithinHorizon(expression, 0);
         if (next == null) {
            System.out.println("FAIL");
            break;
         }
         else {
            System.out.println(next);
         }
      }
   }
   else
      System.out.println("FAIL");
}

OUTPUT:

Input: [ab abab  ab]
ab
ab
ab
ab
===========================
Input: [c ab]
FAIL
===========================
Input: [cab]
FAIL
===========================
0

精彩评论

暂无评论...
验证码 换一张
取 消