开发者

Python - Parsing a string for URLs and extracting them

开发者 https://www.devze.com 2023-02-18 22:59 出处:网络
I know that with urllib you can parse a string and check if it\'s a valid URL. But how would one go about checking if a sentence contains a URL within it, and then ext开发者_如何学Pythonract that URL.

I know that with urllib you can parse a string and check if it's a valid URL. But how would one go about checking if a sentence contains a URL within it, and then ext开发者_如何学Pythonract that URL. I've seen some huge regular expressions out there, but i would rather not use something that I really can't comprehend.

So basically I have an input string, and I need to find and extract all the URLs within that string.

What's a clean way of going about this.


You can search for "words" containing : and then pass them to urlparse (renamed to urllib.parse in Python 3.0 and newer) to check if they are valid URLs.

Example:

possible_urls = re.findall(r'\S+:\S+', text)

If you want to restrict yourself only to URLs starting with http:// or https:// (or anything else you want to allow) you can also do that with regular expressions, for example:

possible_urls = re.findall(r'https?://\S+', text)

You may also want to use some heuristics to determine where the URL starts and stops because sometimes people add punctuation to the URLs, giving new valid but unintentionally incorrect URLs, for example:

Have you seen the new look for http://example.com/? It's a total ripoff of http://example.org/!

Here the punctuation after the URL is not intended to be part of the URL. You can see from the automatically added links in the above text that StackOverflow implements such heuristics.


Plucking a URL out of "the wild" is a tricky endeavor (to do correctly). Jeff Atwood wrote a blog post on this subject: The Problem With URLs Also, John Gruber has addressed this issue as well: An Improved Liberal, Accurate Regex Pattern for Matching URLs Also, I have written some code which also attempts to tackle this problem: URL Linkification (HTTP/FTP) (for PHP/Javascript). (Note that my regex is particularly complex because it is designed to be applied to HTML markup, and attempts to skip URLs which are already linkified (i.e. <a href="http://example.com">Link!</a>)

Second, when it comes to validating a URI/URL, the document you want to look at is RFC-3986. I've been working on a article dealing with this very subject: Regular Expression URI Validation. You may want to take a look at this as well.

But when you get down to it, this is not a trivial task!

0

精彩评论

暂无评论...
验证码 换一张
取 消