I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.开发者_C百科
htmlDom = BeautifulSoup(webPage)
htmlDom.findAll(text=True)
Alternately,
from stripogram import html2text
extract = html2text(webPage)
Both of these extract all the javascript on the page as well, this is undesired.
I just wanted the readable text which you could copy from your browser to be extracted.
If you want to avoid extracting any of the contents of script
tags with BeautifulSoup,
nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)
will do that for you, getting the root's immediate children which are non-script tags (and a separate htmlDom.findAll(recursive=False, text=True)
will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:
def nonScript(tag):
return tag.name != 'script'
def getStrings(root):
for s in root.childGenerator():
if hasattr(s, 'name'): # then it's a tag
if s.name == 'script': # skip it!
continue
for x in getStrings(s): yield x
else: # it's a string!
yield s
I'm using childGenerator
(in lieu of findAll
) so that I can just get all the children in order and do my own filtering.
Using BeautifulSoup, something along these lines:
def _extract_text(t):
if not t:
return ""
if isinstance(t, (unicode, str)):
return " ".join(filter(None, t.replace("\n", " ").split(" ")))
if t.name.lower() == "br": return "\n"
if t.name.lower() == "script": return "\n"
return "".join(extract_text(c) for c in t)
def extract_text(t):
return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)
you can remove script tags in beautiful soup, something like:
for script in soup("script"):
script.extract()
Removing Elements
Try it out:
http://code.google.com/p/boilerpipe/
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
精彩评论