Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" 开发者_JAVA百科").join(" ")
hpricot = Hpricot(html)
hpricot.search("script").remove
hpricot.search("link").remove
hpricot.search("meta").remove
hpricot.search("style").remove
found it on http://www.savedmyday.com/2008/04/25/how-to-extract-text-from-html-using-rubyhpricot/
Nokogiri and Hpricot are pretty interchangeable. I.e. Nokogiri(html) is an equivalent of Hpricot(html). Not really sure I understand what the linked article is trying to achieve, but to:
Extract text from HTML body which includes ignoring large white spaces between tags and words.
This would be an easier approach in Hpricot, and remove the need for the hpricot.search("script").remove
bits. I.e. Just get the body in the first place:
Hpricot(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
And in Nokogiri:
Nokogiri(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
精彩评论