I'm looking for a package / module / function etc. that is approximately the Python equivalent of Arc90's readability.js
http://lab.arc90.com/experiments/readability
http://lab.arc90.com/experiments/readability/js/readability.js
so that I can give it some input.html and the result is cleaned up version of that html page's "main text". I want this so that I can use it on the server-side (unlike the JS version that runs only on browser side).
Any ideas?
PS: I have tried Rhino + env.js and that combination works but the performance is unacceptab开发者_Python百科le it takes minutes to clean up most of the html content :( (still couldn't find why there is such a big performance difference).
Please try my fork https://github.com/buriy/python-readability which is fast and has all features of latest javascript version.
We've just launched a new natural language processing API over at repustate.com. Using a REST API, you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. And it's implemented in python. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.
hn.py via Readability's blog. Readable Feeds, an App Engine app, makes use of it.
I have bundled it as a pip-installable module here: http://github.com/srid/readability
I have done some research on this in the past and ended up implementing this approach [pdf] in Python. The final version I implemented also did some cleanup prior to applying the algorithm, like removing head/script/iframe elements, hidden elements, etc., but this was the core of it.
Here is a function with a (very) naive implementation of the "link list" discriminator, which attempts to remove elements with a heavy link to text ratio (ie. navigation bars, menus, ads, etc.):
def link_list_discriminator(html, min_links=2, ratio=0.5):
"""Remove blocks with a high link to text ratio.
These are typically navigation elements.
Based on an algorithm described in:
http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf
:param html: ElementTree object.
:param min_links: Minimum number of links inside an element
before considering a block for deletion.
:param ratio: Ratio of link text to all text before an element is considered
for deletion.
"""
def collapse(strings):
return u''.join(filter(None, (text.strip() for text in strings)))
# FIXME: This doesn't account for top-level text...
for el in html.xpath('//*'):
anchor_text = el.xpath('.//a//text()')
anchor_count = len(anchor_text)
anchor_text = collapse(anchor_text)
text = collapse(el.xpath('.//text()'))
anchors = float(len(anchor_text))
all = float(len(text))
if anchor_count > min_links and all and anchors / all > ratio:
el.drop_tree()
On the test corpus I used it actually worked quite well, but achieving high reliability will require a lot of tweaking.
Why not try using Google V8/Node.js instead of Rhino? It should be acceptably fast.
I think BeautifulSoup is the best HTML parser for python. But you still need to figure out what the "main" part of the site is.
If you're only parsing a single domain, it's fairly straight forward, but finding a pattern that works for any site is not so easy.
Maybe you can port the readability.js approach to python?
精彩评论