开发者

Removing HTML code in R using gsub

开发者 https://www.devze.com 2023-03-28 13:35 出处:网络
I have a portion of HTML code in R like the one below: \"</a> <img src=\\\"images/arrow_orange.gif\\\" width=\\\"8\\\" height=\\\"12\\\"> <a href=\\\"group.php?g=1\\\">开发者_开发问

I have a portion of HTML code in R like the one below:

"</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">开发者_开发问答XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA"

I want to use gsub to remove the unwanted HTML code so that the output will be:

XXXX YYYY ZZZZ AAAA

I tried <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> as shown here but fail, why?

How can I do it in R? Thanks.


I suggest you heed the warnings of @Ramnath and @Iterator and use a parser instead, but here is the best I can do with your string and regex:

(First add a missing to the end of your input string)

x <- "</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA</a>"

The code:

x1 <- gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", x)
x1
[1] "</a>  XXXX</a>  YYYY</a>  ZZZ</a> AAAA</a>"

gsub("</a>", "", x1)
[1] "  XXXX  YYYY  ZZZ AAAA"
0

精彩评论

暂无评论...
验证码 换一张
取 消