开发者

Python filter list to remove certain links from html source code

开发者 https://www.devze.com 2023-01-31 07:46 出处:网络
I have html source code which I want to filter out one or more links and keep the others. I have set up my filter with \"*\" as the wildcard:

I have html source code which I want to filter out one or more links and keep the others.

I have set up my filter with "*" as the wildcard:

<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text 开发者_StackOverflow中文版right before link <a*>click here</a>

I would like to filter out every instance of the link from the html source code using python. I'm ok with loading the list into an array. I need some help with the filter. Each line break would signify a separate filter and I only want to remove the link(s) and not the text

I am still very new to python and regex/beautifulsoup. Even if you could point me in the right direction, it would be greatly appreciated.


To remove <a> tags and keep only the text not contained within those tags:

>>> from BeautifulSoup import BeautifulSoup as bs
>>> markup = """<a*>Link1</a> <a*>Link2</a> or <a*>Link3</a>
... <a*>A bad link*</a>
... some text* <a*>update*</a>
... other text right before link <a*>click here</a>"""
>>> soup = bs(markup)
>>> TAGS_TO_EXTRACT = ('a',)
>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.extract()
...
>>> soup
  or

some text*
other text right before link

It's not clear to me if you want the text within the tags or not. If you want the text contained within the tags do something like this instead:

>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.replaceWith(tag.text)
...
>>> soup
Link1 Link2 or Link3
A bad link*
some text* update*
other text right before link click here


Parsing it with the only purose of reassembling the whole document discarding just a part of the information would yield a lot of uneeded code.

So, I think this is better as a job for regular expressions. Python's regular expressions can have a callback function that allows one to customize the substitution string. In this case, it is a simple matter of creating a regexp that matches the "bad link", the text in between, and the end link mark-up, and preserves only the text in between.

import re

markup = """<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>"""

filtered = re.sub (r"(\<a.*?>)(.*?)(\</a\s*\>)",lambda match: match.groups()[1] , markup)
0

精彩评论

暂无评论...
验证码 换一张
取 消