开发者

How to split a text using regex, but the splitted words continue keeping the regex separator?

开发者 https://www.devze.com 2023-03-28 23:20 出处:网络
I have a text and using this simple regex to split it in words: [ \\n]. It splits the开发者_JAVA技巧 text into words using spaces and line-breaks.

I have a text and using this simple regex to split it in words: [ \n]. It splits the开发者_JAVA技巧 text into words using spaces and line-breaks.

I want to know if there is a way to keep the whitespace or the line-break in the splited word, because I will use this to a simple sentence detection after some processing.

I'm using the String#split method.


You can use lookbehind as @Piotr Findeisen suggested (+1):

public class RegexExample{
    public static void main(String[] args) {
    String s = "firstWordWithSpaceAfter secondWordWithSpaceAfter wordWithLineBreakAfter\nlastWord";
    String sa[] = s.split("(?<=[ \\n])");
    for (String saa : sa )
        System.out.println("[" + saa + "]");
    }
}

Output:

[firstWordWithSpaceAfter ]
[secondWordWithSpaceAfter ]
[wordWithLineBreakAfter
]
[lastWord]

Short explanation:

?<= is look behind, meaning you got a match if the data before the expression you are looking for is equal to the regex coming after ?<= (in this case [ \\n])

[ \\n] is regex that means one of the characters in the []

so the whole regex says split every time that the character before the expression / word is either space or \n.

Since we didn't try to match space or \n, it will not remove them.


Conside using zero-width positive lookbehind / lookahead. See Pattern javadoc around Special constructs (non-capturing)


I think your only option is to do something like this:

String myString = "Joe Blow\n1234 Fake Road\nHere, There, 12345";
String[] lines = myString.split("\\n");
Set<String[]> wordsByLine = new LinkedHashSet<String[]>();
for (String line : lines) {
  wordsByline.add(line.split(" "));
}


Really quickly off the top of my head, if the regex was always matching single characters, you could use the length to determine where they sat in the original String. Then you can take a substring for the delimiting character.

Bit dirty, but should do the trick.


I'm still not sure what you are trying to do, but if \n has a different meaning than " ", you should deal with them separately.

String[] sentences = text.split("\\n");
...
for (String sentence : sentences) {
    ...
    String[] words = sentence.split(" ");
    ...
}
0

精彩评论

暂无评论...
验证码 换一张
取 消