I have html source code which I want to filter out one or more links and keep the others.
I have set up my filter with "*" as the wildcard:
<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text 开发者_StackOverflow中文版right before link <a*>click here</a>
I would like to filter out every instance of the link from the html source code using python. I'm ok with loading the list into an array. I need some help with the filter. Each line break would signify a separate filter and I only want to remove the link(s) and not the text
I am still very new to python and regex/beautifulsoup. Even if you could point me in the right direction, it would be greatly appreciated.
To remove <a>
tags and keep only the text not contained within those tags:
>>> from BeautifulSoup import BeautifulSoup as bs
>>> markup = """<a*>Link1</a> <a*>Link2</a> or <a*>Link3</a>
... <a*>A bad link*</a>
... some text* <a*>update*</a>
... other text right before link <a*>click here</a>"""
>>> soup = bs(markup)
>>> TAGS_TO_EXTRACT = ('a',)
>>> for tag in soup.findAll():
... if tag.name in TAGS_TO_EXTRACT:
... tag.extract()
...
>>> soup
or
some text*
other text right before link
It's not clear to me if you want the text within the tags or not. If you want the text contained within the tags do something like this instead:
>>> for tag in soup.findAll():
... if tag.name in TAGS_TO_EXTRACT:
... tag.replaceWith(tag.text)
...
>>> soup
Link1 Link2 or Link3
A bad link*
some text* update*
other text right before link click here
Parsing it with the only purose of reassembling the whole document discarding just a part of the information would yield a lot of uneeded code.
So, I think this is better as a job for regular expressions. Python's regular expressions can have a callback function that allows one to customize the substitution string. In this case, it is a simple matter of creating a regexp that matches the "bad link", the text in between, and the end link mark-up, and preserves only the text in between.
import re
markup = """<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>"""
filtered = re.sub (r"(\<a.*?>)(.*?)(\</a\s*\>)",lambda match: match.groups()[1] , markup)
精彩评论