开发者

What are smart tags and how do I remove them from html?

开发者 https://www.devze.com 2023-02-09 08:06 出处:网络
So I am still working on this parser. Today I found a document with the tag <st1:place w:st=\"on\"> Google tells me it is a Microsoft Office Smart Ta开发者_Python百科g.

So I am still working on this parser. Today I found a document with the tag <st1:place w:st="on"> Google tells me it is a Microsoft Office Smart Ta开发者_Python百科g.

I would like to get rid of these things but I cannot find a list of what they are or how many of them there are?

If they all follow the <...:...> pattern it would be easy to remove with regex.

The document has no doctype and a .jsp extention, but all the content is between two <html> tags, and however non-standard the beast is, I still need to parse it.

OK it is actually not a big issue but it throws off my formatting & bugs me.


This regexp should do the trick:

/<[:alnum:]+:[\s\S]*>/

It will trigger on any tag that opens with < followed by an alphanumeric pattern followed by a ':' colon.

Alternatively:

/<\s*[:alnum:]+:[\s\S]*>/

Would allow looser formatter of the tag (space between the opening < and the namespace)


We wanted to remove the <w:smartTag> and what is listed below is helped us.

/<w:smartTag[^>]*>/
0

精彩评论

暂无评论...
验证码 换一张
取 消