Pattern, matcher in Java, REGEX help_问答_开发者

开发者 https://www.devze.com 2023-01-10 07:28 出处：网络

I\'m trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:

相关专题：regex

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:

Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do som开发者_运维知识库ething like
while (m.find()) {
  // do something here
}

I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. For example, as a test case to see how it worked, I did:

if (m.find()) {
System.out.println(s.substring(i, m.end()));
    }

To the text file that has: This is an example example test test test.

Why is my output This is?

Edit:

if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. For example

List<String> newString = new ArrayList<String>();
for (String s : lineOfWords { 
   s = s.replaceAll( code from Kobi here);
   newString.add(s);
}

but then it doesn't give me the new s, but the original s. Is it because of shallow vs deep copy?

Try something like:

s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");

That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions.
The regex captures a first word: \b(\w+)\b, and then attempts to match spaces and repetitions of that word: (\s+\1)+. The final \b is to avoid partial matching of \1, as in "for formatting".

The first match is "ThIS IS an example...", so m.end() points to the end of the second "is". I'm not sure why you use i for the start index; try m.start() instead.

To improve your regex, use \b before and after the word to indicate that there should be word boundaries: (\\b\\w+\\b). Otherwise, as you're seeing, you'll get matches inside of words.