开发者

Using regex to match HTML

开发者 https://www.devze.com 2023-03-08 03:50 出处:网络
Here\'s an input HTML string: <p>Johnny: My favorite color is pink<br /> Sarah: My favorite color is blue<br />

Here's an input HTML string:

<p>Johnny: My favorite color is pink<br />

Sarah: My favorite color is blue<br />

Johnny: Let's swap genders?<br />

Sarah: OK!<br />

</p>

I开发者_开发百科 want to regex-match the bolded part above. Basically put, find any matches between ">" (or beginning of line) and ":"

I made this regex (?>)[^>](.+): but it didn't work correctly, it bolded the parts below, including the <p> tag. I don't want to match any HTML tag:

<p>Johnny: My favorite color is pink<br />

Sarah: My favorite color is blue<br />

Johnny: Let's swap genders?<br />

Sarah: OK!<br />

</p>

I am using Java, with code like this:

Matcher m = Pattern.compile("`(?>)[^>](.+):`", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL).matcher(string); 


Following code should work:

String str = "<p>Johnny Smith: My favorite color abc: is pink<br />" +
"Sarah: My favorite color is dark: blue<br />" +
"Johnny: Let's swap: genders?<br />" +
"Sarah: OK: sure!<br />" +
"</p>";

Pattern p = Pattern.compile("(?:>|^)([\\w\\s]+)(?=:)", Pattern.MULTILINE);
Matcher m = p.matcher(str); 
while(m.find()){
    System.out.println(m.group(1));
}

OUTPUT

Johnny Smith
Sarah
Johnny
Sarah


If you want a match when a word is followed by ':' then "\w+:" should be enough. But if you want to include the '>' possibility you can try:

        String s = "<p>Johnny: My favorite color is pink<br />" +
            "Sarah: My favorite color is blue<br />" +
            "Johnny: Let's swap genders?<br />" +
            "Sarah: OK!<br />" +
            "</p>";

    Pattern p = Pattern.compile("[>]?(\\w+):");
    Matcher m = p.matcher(s); 
    while(m.find()){
        System.out.println(m.start()+" : "+m.group(1));
    }
0

精彩评论

暂无评论...
验证码 换一张
取 消