开发者

Optimizing python link matching regular expression

开发者 https://www.devze.com 2023-01-01 22:47 出处:网络
I have a regular expression, links = re.compile(\'<a(.+?)href=(?:\"|\\\')?((?:https?://|/)[^\\\'\"]+)(?:\"|\\\')?(.*?)>(.+?)</a>\',re.I).findall(d开发者_运维知识库ata)

I have a regular expression, links = re.compile('<a(.+?)href=(?:"|\')?((?:https?://|/)[^\'"]+)(?:"|\')?(.*?)>(.+?)</a>',re.I).findall(d开发者_运维知识库ata)

to find links in some html, it is taking a long time on certain html, any optimization advice?

One that it chokes on is http://freeyourmindonline.net/Blog/


Is there any reason you aren't using an html parser? Using something like BeautifulSoup, you can get all links without using an ugly regex like that.


I'd suggest using BeautifulSoup for this task.


How about more straight handling of href's?

re_href = re.compile(r"""<\s*a(?:[^>]+?)href=("[^"]*(\\"[^"]*)*"|'[^']*(\\'[^']*)*'|[^\s>]*)[^>]*>""", re.I)

That takes about 0.007 seconds in comparsion with your findall which takes 38.694 seconds on my computer.

0

精彩评论

暂无评论...
验证码 换一张
取 消