So I am still working on this parser. Today I found a document with the tag <st1:place w:st="on">
Google tells me it is a Microsoft Office Smart Ta开发者_Python百科g.
I would like to get rid of these things but I cannot find a list of what they are or how many of them there are?
If they all follow the <...:...>
pattern it would be easy to remove with regex.
The document has no doctype and a .jsp extention, but all the content is between two <html>
tags, and however non-standard the beast is, I still need to parse it.
OK it is actually not a big issue but it throws off my formatting & bugs me.
This regexp should do the trick:
/<[:alnum:]+:[\s\S]*>/
It will trigger on any tag that opens with < followed by an alphanumeric pattern followed by a ':' colon.
Alternatively:
/<\s*[:alnum:]+:[\s\S]*>/
Would allow looser formatter of the tag (space between the opening < and the namespace)
We wanted to remove the <w:smartTag>
and what is listed below is helped us.
/<w:smartTag[^>]*>/
精彩评论