开发者

Counting words with regular expression "\S+"

开发者 https://www.devze.com 2023-03-24 12:38 出处:网络
Why does wordCount end up being 1, rather than 5, in the code below? import java.util.regex.Matcher; import java.util.regex.Pattern;

Why does wordCount end up being 1, rather than 5, in the code below?

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordCount {

    public static void main(String[] args) {
        final Pattern wordCountRegularExpression = Pattern.compile("\\S+");
        final Matcher matcher = wordCountRegularExpression
                .matcher("one two three four five");
        int wordCount = 0;
        while (matcher.find()) {
            wordCount++;
        }
        System.out.println("wordCount: " + wordCount);
    }
}

Doesn't the patter开发者_开发问答n "\S+" match a word, since it means one or more non-space characters?

This does work by the way:

    final Pattern wordCountRegularExpression = Pattern.compile("\\b\\w+\\b");

But I still don't understand why the original code doesn't work.


Doesn't the pattern "\S+" match a word, since it means one or more non-space characters?

Yes.


Using

import java.util.regex.*; 

in java 7, the following pattern:

Pattern.compile("\\S+");

Will not count word, but spaces.

So, it should return 4 for the input: "one two three four five", since it have 4 spaces.


It depends on what you're using to separate the words. When I copy the code from your question into my editor, I see plain old spaces (U+0020), but when I viewsource the page I see non-breaking spaces (U+00A0). Java doesn't recognize the NBSP as a whitespace character.

Now the question is why am I seeing NBSP's in the string literal, but nowhere else? And why are they being converted to spaces when I copy/paste? Is anyone else seeing that?

0

精彩评论

暂无评论...
验证码 换一张
取 消