开发者

What's the best way to get a description of the website, in Python?

开发者 https://www.devze.com 2023-01-08 15:40 出处:网络
Suppose I downloaded the HTML code, and I can parse it. How do I get the \"best\" description of that website, i开发者_开发百科f that website does not have meta-description tag?You could get the first

Suppose I downloaded the HTML code, and I can parse it. How do I get the "best" description of that website, i开发者_开发百科f that website does not have meta-description tag?


You could get the first few sentence returned from something like Readability.

Safari 5 uses it, so it must be alright :)


To follow up on the "Readability" suggestion above (which itself is inspired by the website InstaPaper), they have release the JavaScript: http://code.google.com/p/arc90labs-readability/. What's more, some guy took that and ported it to python: http://github.com/gfxmonk/python-readability. Rejoice!


It's very hard to come up with a rule that works 100% of the time, obviously, but my suggestion as a starting point would be to look for the first <h1> tag (or <h2>, <h3>, etc - the highest one you can find) then the bit of text after that can be used as the description. As long as the site is semantically marked-up, that should give you a good description (I guess you could also take the contents of the <h1> itself, but that's more like the "title").

It's interesting to note that Google (for example) uses a keyword-specific extract of the page contents to display as the description, rather than a static description. Not sure if that'll work for your situation, though.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号