开发者

Regex help required

开发者 https://www.devze.com 2023-01-19 02:46 出处:网络
I am trying to replace two or more occurences of <br/> (like <br/><br/><br/>) tags together with two <br/><br/> with the following pattern

I am trying to replace two or more occurences of <br/> (like <br/><br/><br/>) tags together with two <br/><br/> with the following pattern

Pattern brTagPattern = Pattern.compile("(<\\s*br\\s*/\\s*>\\s*){2,}", 
     Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

But there are some case开发者_开发知识库s where '<br/> <br/>' tags come with a space and they get replaced with 4 <br/> tags which was actually supposed to be replaced with just 2 tags.

What can i do to ignore 2 or 3(few) spaces that come in between the tags ?


Probably not the answer you want to hear, but it is general wisdom that you should not attempt to parse XML/HTML with regular expressions. So many things can go wrong -- it's a much better idea to use a parsing library specifically meant for such data, which will also completely bypass the issue you're having.

Take a look at JAXB if you are certain your HTML is well-formed XML, or if the HTML is likely to be messy and incompliant (like most real-world HTML) you should try something like TagSoup.


Here's some Groovy code to test your Pattern:

import java.util.regex.*

Pattern brTagPattern = Pattern.compile( "(<\\s*br\\s*/\\s*>\\s*){2,}", Pattern.CASE_INSENSITIVE | Pattern.DOTALL )
def testData = [
  ['',                            ''],
  ['<br/>',                       '<br/>'],
  ['< br/> <br />',               '<br/><br/>'],
  ['<br/> <br/><br/>',            '<br/><br/>'],
  ['<br/>   < br/ > <br/>',       '<br/><br/>'],
  ['<br/> <br/>   <br/>',         '<br/><br/>'],
  ['<br/><br/><br/> <br/><br/>',  '<br/><br/>'],
  ['<br/><br/><br/><b>w</b><br/>','<br/><br/><b>w</b><br/>'],
 ]

testData.each { inputStr, expected ->
  Matcher matcher = brTagPattern.matcher( inputStr )
  assert expected == matcher.replaceAll( '<br/><br/>' )
}

And everything seems to pass fine...


You can do that changing a little your regex:

Pattern brTagPattern = Pattern.compile("<\\s*br\\s*/\\s*>\\s*<\\s*br\\s*/\\s*>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

This will ignore every spaces between two
. If you just want exactly 2 or three, you can use:

Pattern brTagPattern = Pattern.compile("<\\s*br\\s*/\\s*>(\\s){2,3}<\\s*br\\s*/\\s*>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号