I'd like to hear if anyone can help to to replace my large XML file's HTML markup.
The XML file has my own schema and it's all fine. But I need to remove <sspan>, <style>, <div>
a开发者_运维百科nd attributes in <p>
tags.
For an example, I need to keep all <ul>, <ol>, <li>, <strong>, <a>, <img>
and other tags but remove <div>
(with attributes), <span>
(with attributes), and attributes in <p>
tags.
I have tried many examples from this site and many other sites. But most of them didn't worked.
Quoting from an answer I posted yesterday:
I've heard some very good things about Beautiful Soup, HTML Purifier, and the HTML Agility Pack, which use Python, PHP, and .NET, respectively. Trust me--save yourself some pain and use those instead.
I strongly advise you not to use regex for this. No sane regex is going to work, or probably even come close to working. However, a decent XML parser can do this fairly easily. I'm not sure what programming languages you have access to, but if you can use PHP, .NET or another programming language, you can use the above parsers to find each span
, style
, div
, and p
and remove attributes or the entire tags.
jQuery has some good functionality for DOM-manipulation like you're describing, and you can use it to generate HTML which you then cut and paste.
If you absolutely must use regex, you could try this:
- Pattern:
<\s*/?\s*(span|style|div)\b[^>]*?>
- Replacement: (nothing)
- Pattern:
<\s*p\b[^>]*?>
- Replacement:
<p>
精彩评论