Let's say you're given http://nytimes.com How would you pull out the "main" image?
The reason I'm asking is because Flipboard is 开发者_运维知识库able to grab the main image from a website, just using the URL.
You could parse out all the image tags. But then what?
I don't believe there's a standard method. You could start by looking for an Open Graph Protocol image tag. Facebook uses these to select images for urls posted in status updates and comments.
<meta property="og:image" content="http://ia.media-imdb.com/rock.jpg"/>
If you're prepared to use a third party, Embedly offer this as a chargeable service.
Embedly provides a powerful API to convert standard URLs into embedded videos, images, and rich article previews from 218 leading providers.
There are many strategies to determine what is the "main" image of an URL:
- many websites now declare what the main image is (for Facebook OpenGraph or Twitter Cards)
- sometimes, the image can be guessed from the URL or by doing an API call (especially true for image hosting websites such as Instagram)
- the main image can also be determined with by analyzing the webpage with content extraction techniques (Readability). You might want to filter out "noise" to get rid of tracking pixels or ads.
- if all these techniques fail, you can download all the images and assume that the largest images are the most interesting.
I've created a JavaScript library that uses most of these techniques to determine the "main" picture of an URL : ImageResolver.
There really isn't anything that is considered the "main" image in a web page--nothing in HTML or otherwise to distinguish this. Not to mention you'd probably have to read all the images in CSS (or rather the background images etc). But if I had to do this, here is what I would do:
First I would decide a suitable image size, lets say a 400x400 minimum. (I don't want to pick any old image, something really small would likely scale horribly)
I would then iterate through each image on the page.2.
For each image I encountered I would check the size of it3. If it was 400x400 (my predefined size) or larger I would use this image. If it wasn't, I would check that its the largest image I've found so far and if so keep its information stored off to the side.
Once I had reached a predefined number of images I've checked
(for argument lets say 10, but surely you'd probably go much higher) I'd use the largest image I've found (stored off to the side) because I wouldn't want to scan the page indefinitely looking for images!
Facebook allows the user to pick one of several images that it has deemed to be a "main" image. As far as automatically determining a "main" image, I would judge it based on page position, size, relation to text, and (if you wanted to be more sophisticated) its visual content.
For example, you could use a simple face detection program, or look at color breakdowns to determine if the picture was "interesting" to you or not.
EDIT: In the case of www.nytimes.com, I would probably just look at the page structure, because a large carousel of images is located right underneath an H1 tag.
精彩评论