开发者

How do you use the Java word boundary with apostrophes?

开发者 https://www.devze.com 2023-02-06 17:51 出处:网络
I am trying to delete all the occurrences of a word in a list, but I am having trouble when there are apostrophes in the words.

I am trying to delete all the occurrences of a word in a list, but I am having trouble when there are apostrophes in the words.

String phrase="bob has a bike and bob's bike is red";
String word="bob";
phrase=phrase.replaceAll("\\b"+word+"\\b","");
System.out.println(phrase);

output:

has a bike and 's bike is red

What I want is

has a bike and bob's bike is red

I have a limited understanding of regex so I'm guessing there is a solution, but I do not now enough to create the regex to handle apostrophes. Also I would like开发者_Go百科 it to work with dashes so the phrase the new mail is e-mail would only replace the first occurrence of mail.


It all depends on what you understan to be a "word". Perhaps you'd better define what you understand to be a word delimiter: for example, blanks, commas .... And write something as

phrase=phrase.replaceAll("([ \\s,.;])" + Pattern.quote(word)+ "([ \\s,.;])","$1$2");

But you'll have to check additionally for occurrences at the start and the end of the string For example:

  String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
  String word="bob";
  phrase=phrase.replaceAll("([\\s,.;])" + Pattern.quote(word) + "([\\s,.;])","$1$2");
  System.out.println(phrase);

prints this

bob has a bike ,  and boba bob's bike is red and "bob" stuff.

Update: If you insist in using \b, considering that the "word boundary" understand Unicode, you can also do this dirty trick: replace all ocurrences of ' by some Unicode letter that you're are sure will not appear in your text, and afterwards do the reverse replacemente. Example:

  String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
  String word="bob";
  phrase= phrase.replace("'","ñ").replace('"','ö');
  phrase=phrase.replaceAll("\\b" + Pattern.quote(word) + "\\b","");
  phrase= phrase.replace('ö','"').replace("ñ","'");
  System.out.println(phrase);

UPDATE: To summarize some comments below: one would expect \w and \b to have the same notion as to which is a "word character", as almost every regular-expression dialect do. Well, Java does not: \w considers ASCII, \b considers Unicode. It's an ugly inconsistence, I agree.

Update 2: Since Java 7 (as pointed out in comments) the UNICODE_CHARACTER_CLASS flag allows to specify a consistent Unicode-only behaviour, see eg here.


\b\S*(bob|mail)\S*\b

Be careful with false positives, this could match more than you want. If you need "prefixes" or "sufixes" of no more than 2 characters (that would be things like "'s" or "e-"), use \S{0,2} instead of \S*.

The regex says:

\b           # a word boundary
\S*          # any number of non-spaces
(            # match group 1 (to enable a choice) 
  bob|mail   #   "bob" or "mail"
)            # end match group 1
\S*          # any number of non-spaces
\b           # a word boundary

So, in Java:

phrase = phrase.replaceAll("\\b\\S*(bob|mail)\\S*\\b", "");

Be careful with things like

phrase = phrase.replaceAll("\\b" + word + "\\b", "");

That should be

phrase = phrase.replaceAll("\\b" + Pattern.quote(word) + "\\b", "");

since whenever word contains regex meta characters, your regex will break unless you properly escape the string beforehand using Pattern.quote().

0

精彩评论

暂无评论...
验证码 换一张
取 消