I'm making an app that parses html and gets images from it. Parsi开发者_运维技巧ng is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.
I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:
>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'
As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:
HTTPError: HTTP Error 400: Bad Request
Is there a way to fix this problem in urllib?
".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:
>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'
I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is
if len(urlparse.urlparse(baseurl).path) > 1:
Then you can combine it with the indexing suggested by demas. For example:
start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])
This way, you will not attempt to go to the parent of the root URL.
If you'd like that /../test
would mean the same as /test
like paths in a file system then you could use normpath()
:
>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'
urlparse.urljoin("http://www.example.com/", "../test.png"[2:])
It is what you need?
精彩评论