Is there a API or systematic way of stripping irrelevant parts of a web page while scraping it via Python? For instance, take this very page -- the only important part is the question and the answers, not the side bar column, header, etc开发者_运维知识库. One can guess things like that, but is there any smart way of doing it?
There's the approach from the Readability bookmarklet, with at least two Python implementations available:
- decruft
- python-readability
In general, no. In specific cases, if you know something about the structure of the site you are scraping, you can use a tool like Beautiful Soup to manipulate the DOM.
One approach is to compare the structure of multiple webpages that share the same template. In this case you would compare multiple SO questions. Then you can determine which content is static (useless) or dynamic (useful).
This field is known as wrapper induction. Unfortunately it is harder than it sounds!
This git hub project solves your problem, but it's in Java. May be worth a look: goose
精彩评论