I'm interested to find out how to scrub a html page and present it nicely -- remove all the clutters and reformat the main text into a very readable format -- like http://lab.arc90.com/experi开发者_运维技巧ments/readability or Instapaper.
Is it a simple page parsing and removing elements that are not within
?
Was this discussed somewhere else?
Readability is not a simple parser, it use complex algorithm to retrieve only the required components, if you are a not a guru at programming i would suggest you use their free service highlighted below.
you can request for a developer api from readability (http://www.readability.com/publishers/api)
If you request for the parser it will do exactly what you want to achieve, and that is to extract content from sites. Just remember to give them a good enough reason to allow you to use their API.
A query to their parsing service will look like the following
https://www.readability.com/api/content/v1/parser?url={url to be parsed here}&token={your api key here}
The request will return a response like:
HTTP/1.0 200 OK { "domain": "blog.readability.com", "author": "Richard Ziade", "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/",
"short_url": "http://rdd.me/kbgr5a1k", "title": "Step Up & Be Heard: Readability Ideas", "total_pages": 1, "word_count": 175, "content": "<div>\n \n<div class=\"entry\">\n\t<p>When we launched Readability [snip] ...</div>\n</div>", "date_published": "2011-02-22 00:00:00", "next_page_id": null, "rendered_pages": 1 }
For the hard core guys out there, checkout readability nodeJS,ruby and python port from here http://arrix.blogspot.com/2010/11/server-side-readability-with-nodejs.html
Happy coding
https://github.com/jiminoc/goose/wiki does something like you're asking, source code is openly available along with unit tests
If the web page or site in question has good use of semantic elements and structure, you could just use a different CSS stylesheet, which can drastically change the layout and display completely.
精彩评论