开发者

Match "without this"

开发者 https://www.devze.com 2023-04-10 21:47 出处:网络
I need to rem开发者_JS百科ove all <p></p> that are only <p>\'s in <td>. But how it can be done?

I need to rem开发者_JS百科ove all <p></p> that are only <p>'s in <td>.

But how it can be done?

import re
text = """
    <td><p>111</p></td>
    <td><p>111</p><p>222</p></td>
    """
text = re.sub(r'<td><p>(??no</p>inside??)</p></td>', r'<td>\1</td>', text)

How can I match without</p>inside?


I would use minidom. I stole the following snippet from here which you should be able to modify and work for you:

from xml.dom import minidom

doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
    if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
        parentNode = element.parentNode
        parentNode.insertBefore(doc.createComment(element.toxml()), element)
        parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()

Thanks @Ivo Bosticky


While using regexps with HTML is bad, matching a string that does not contain a given pattern is an interesting question in itself.

Let's assume that we want to match a string beginning with an a and ending with a z and take out whatever is in between only when string bar is not found inside.

Here's my take: "a((?:(?<!ba)r|[^r])+)z"

It basically says: find a, then find either an r which is not preceded by ba, or something different than r (repeat at least once), then find a z. So, a bar cannot sneak in into the catch group.

Note that this approach uses a 'negative lookbehind' pattern and only works with lookbehind patterns of fixed length (like ba).


I would definitely recommend using BeautifulSoup for this. It's a python HTML/XML parser.

http://www.crummy.com/software/BeautifulSoup/


Not quite sure why you want to remove the P tags which don't have closing tags. However, if this is an attempt to clean code, an advantage of BeautifulSoup is that is can clean HTML for you:

from BeautifulSoup import BeautifulSoup
html = """
<td><p>111</td>
<td><p>111<p>222</p></td>
"""
soup = BeautifulSoup(html)
print soup.prettify()

this doesn't get rid of your unmatched tags, but it fixes the missing ones.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号