开发者

java regex tricky pattern

开发者 https://www.devze.com 2023-04-10 16:30 出处:网络
I\'m stucked for a while with a regex that does me the following: split my sentences with this: \"[\\W+]\"

I'm stucked for a while with a regex that does me the following:

  • split my sentences with this: "[\W+]"
  • but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.

    Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.

    For example, the sentence:

    "Hello my-name is Richard"

    First i collect {Hello, my, name, is, Richard} then i collect {my-name} then i add {my-name} to {Hello, my, name, is, Richard} then i take out {my} and {name} in here {Hello, my, name, is, Richard}. result: {Hello, my-name, is, Richard}

    this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:

    "split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not tw开发者_如何学编程o words.


If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.

If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.


Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.


Your description isn't clear enough, but why not just split it up by spaces?


I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:

[\W&&[^-]]+

it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].


Almost the same regular expression as in your previous question:

String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Just added the option (...)? to also match non-hypened words.

0

精彩评论

暂无评论...
验证码 换一张
取 消