开发者

Python: How do you use re to ignore links in parentheses?

开发者 https://www.devze.com 2023-04-04 19:58 出处:网络
The relevant part of the code is: import re reargs = \'<a\\s*href=[\\\'|\"](.*?)[\\\'\"].*?>\' link = re.search(reargs,content,flags=re.IGNORECASE)

The relevant part of the code is:

import re
reargs = '<a\s*href=[\'|"](.*?)[\'"].*?>'
link = re.search(reargs,content,flags=re.IGNORECASE)

I'm building a crawler and the web pages I'm working with have links in parentheses that I don't want so it would be like:

Foo foo foo fo开发者_如何学运维o (link) foo foo foo foo link foo foo foo foo (foo link foo) foo foo link foo foo link......and so on


If there can be multiple sets of nested parentheses like "((foo) link)", I don't think this is possible with regular expressions. In particular, note that parentheses can be used inside URLs (such as at wikipedia), so there may still be nested parens even if the text itself doesn't contain any. So, in the general case I don't think this can be done with regex.

In order to solve it, I will assume you can have parentheses at most 1 level deep, and that no URLs contain parentheses.

The regex you're looking for is something like the following:

(\([^\)]*\)|[^\(<])*_link_

Where _link_ is a regular expression matching a link (which you describe in the problem statement, though it might need some tweaking). To summarize what that first part of my regex is: it matches 0 or more of either a parenthetical statement or a non-link non-parenthesis character. Now, use the matched back references (link.group(2) in your example) to find your URL.


In general parsing HTML with regex is a bad idea. But because you asked, and the general question has merit (how to ignore cases where your match is surrounded by parentheses) I'll tell you what I think.

Now, because I don't know what your page looks like I'll just say that, in general, you can exclude matches by adding [^x],except where x is the character you don't want. The brackets make it so that it will match anything, and the ^ excludes whatever follows.

So you can exclude parentheses by surround your match string with [^(]foo[^)]. If there are other characters between the parentheses you'll have to account for that separately.


With lxml you could do something like this:

import lxml.html
import re

tree = lxml.html.parse("http://pastehtml.com/view/b7604in99.html")
links = tree.xpath("//a")

for link in links:
    if re.match(r'^\(.*\)$', link.text.strip()):
        print link.get('href')
0

精彩评论

暂无评论...
验证码 换一张
取 消