Suppose I downloaded the HTML code, and I can parse it. How do I get the "best" description of that website, i开发者_开发百科f that website does not have meta-description tag?
You could get the first few sentence returned from something like Readability.
Safari 5 uses it, so it must be alright :)
To follow up on the "Readability" suggestion above (which itself is inspired by the website InstaPaper), they have release the JavaScript: http://code.google.com/p/arc90labs-readability/. What's more, some guy took that and ported it to python: http://github.com/gfxmonk/python-readability. Rejoice!
It's very hard to come up with a rule that works 100% of the time, obviously, but my suggestion as a starting point would be to look for the first <h1>
tag (or <h2>
, <h3>
, etc - the highest one you can find) then the bit of text after that can be used as the description. As long as the site is semantically marked-up, that should give you a good description (I guess you could also take the contents of the <h1>
itself, but that's more like the "title").
It's interesting to note that Google (for example) uses a keyword-specific extract of the page contents to display as the description, rather than a static description. Not sure if that'll work for your situation, though.
精彩评论