I would like to scrape several different discussions forums, most of which have different HTML formats. Rather than dissecting the HTML for each page, it would be more efficient (and fun) to implement some sort of Learning Algorithm that could identify the different messages (i.e. structures) on each page, and individually parse them while simultaneously ignoring all the extraneous crap (i.e., ads and other nonsense). Could someone please point me to some references or sample code for work that's already been carried out in this area.
Moreover, does anyone know of p开发者_运维知识库seudocode for Arc90's readability code?
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
build a solution that:
- takes some sample webpages with the same structure (eg forum threads)
- analyzes the DOM tree of each to find which parts are the same / different
- where they are different is the dynamic content you are after (posts, user names, etc)
This technique is known as wrapper induction.
There seems to be a Python port of arc90's Readability script that might point you in the right direction (or at least some direction).
Maybe not exactly correct but there's an O'Reilly book called 'Collective Intelligence' that may lead you in the right direction for what you are attempting to do. Additionally, many of the examples are in python :)
精彩评论