开发者

Upload images from from web-page

开发者 https://www.devze.com 2023-02-23 01:08 出处:网络
I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.

I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.

Main problem for me is that it takes too much time for web pages with big number of images.

I'm doing this in Django (using curl or urllib) according to the next scheme:

  1. Grab html of the page (takes about 1 sec for big pages):

    file = urllib.urlopen(requested_url)
    html_string = file.read()
    
  2. Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)

  3. Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:


 def get_image_size(uri):
    file = urllib.urlopen(uri)
    p = ImageF开发者_如何学编程ile.Parser()
    data = file.read(1024)
    if not data:
        return None
    p.feed(data)
    if p.image:
        return p.image.size
    file.close()
    #not an image
    return None

As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).

So how can I make it work faster?

May be is there any way for not making a request for every single image?

Any help will be highly appreciated.

Thanks!


i can think of few optimisations:

  1. parse as you are reading a file from the stream
  2. use SAX parser (which will be great with point above)
  3. use HEAD to get size of the images
  4. use queue to put your images, then use few threads to connect and get file sizes

example of HEAD request:

$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl

HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close

Connection closed by foreign host.


You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).

Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.

|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"

def get_file_size(uri):
    file = urllib2.urlopen(uri)
    content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
    _, str_length = content_header.split(':')
    length = int(str_length.strip())
    return length

if __name__ == "__main__":
    get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py  0.06s user 0.01s system 35% cpu 0.196 total
0

精彩评论

暂无评论...
验证码 换一张
取 消