I am working on a project to get Google search web pages and then clean HTML tags to obtain pure text 开发者_高级运维content.
Any suggestion for available tools (esp. Python tools)
many thanks.
I'd check out Pattern, which is a Python web mining module providing a suite of text retrieval, analysis, and viz tools. I haven't personally used it but looks powerful.
Module pattern.web is a web toolkit that bundles various API's (Google, Gmail, Bing, Twitter, Wikipedia, Flickr) with a robust HTML parser and web spider. Its purpose is to retrieve online content in an easy-to-use, uniform way.
Python has a built in one that's actually pretty quick, found here. There's also a really powerful one called Beautiful Soup that offers additional functionality, especially for HTML scraping.
However, I also have to ask why not use the search API?
Finally found a nice suite BootCat.
精彩评论