Techniques for extracting the 'best' image from a webpage_问答_开发者

Techniques for extracting the 'best' image from a webpage

开发者 https://www.devze.com 2022-12-22 18:43 出处：网络

I\'m trying to build something akin to Facebook\'s \"Share\" functionality for my website. I\'ve gotten to the point where I can accept a URL, scrape it for meta keywords and suitably get titles/desc

I'm trying to build something akin to Facebook's "Share" functionality for my website.

I've gotten to the point where I can accept a URL, scrape it for meta keywords and suitably get titles/descriptions, but I'm a bit stuck as to the best way to determine 'likely' photos the user may want to share.

I currently use the SimpleXMLElement to turn the page into a traversable DOM, and find all the tags, turning them into absolute URLs. After that, I'm not sure how I can go about finding a suitabl开发者_开发问答e thumbnail.

Do I download them all, and go by file size? Do I use some sort of heuristic like, "was encountered in the middle of the page"?

Does anyone else have any recommendations, suggestions, or tips?

I wrote something similar a while ago to get images from scraped blog posts. My criteria for choosing an image was something along the lines of getting a list of all images on the page then assigning 'priority points':

Ignore images hosted from a blacklist taken from AdBlocker's list
Ignore indirect images, eg linked to from stylesheets or in an IFRAME
Ignore images under 50 pixels wide or high
Ignore images which repeat more than once
Assign priority points to images hosted from a whitelist of hosts (eg photobucket, imageshack.us)
Assign priority points to the largest 3 images on the page
Assign priority points to images on the same host
Assign priority points to images with an ALT tag defined
Assign priority points to images appearing in a P tag

Then pick the one with the most priority points. It certainly wasn't foolproof or overly scientific but it got something useful far more often than not.

I don't have any direct experience doing this so I'm not sure that there is any specific best practice, but in general I think a heuristic approach looking at several factors would make sense because of the variability found in website implementations.

I would look at two sets of items: image properties and the context of the where/how the images are placed.

Image Properties:

Width and height meet minimum thresholds
Aspect ratio is reasonable (background images that tile may have extreme aspect ratios, which provides a good indication that the image may not be suitable)
More than one color exists in image (harder to detect, but may avoid various background images)

Image Context:

Image does not repeat on page (this avoids using icons and other design elements that may repeat)
Occurs after h1, h2, etc tags on page; this gets to your point about the images coming from the middle of the page, again avoiding design elements.
Has an alt tag (though this is not consistently used, so perhaps does not provide much useful information)

I would assigns weights to the previous items and then rank the images you find according to how well each image satisfies the rules.

Also, note that some pages may use CSS (or Flash, etc) to display images. These our outside of your purview of images (according to the algorithm you defined); perhaps not a big deal, but something to consider.