开发者

Suggestions for obtaining Google search results and cleaning HTML tags

开发者 https://www.devze.com 2023-02-14 18:07 出处:网络
I am working on a project to get Google search web pages and then clean HTML tags to obtain pure text 开发者_高级运维content.

I am working on a project to get Google search web pages and then clean HTML tags to obtain pure text 开发者_高级运维content.

Any suggestion for available tools (esp. Python tools)

many thanks.


I'd check out Pattern, which is a Python web mining module providing a suite of text retrieval, analysis, and viz tools. I haven't personally used it but looks powerful.

Module pattern.web is a web toolkit that bundles various API's (Google, Gmail, Bing, Twitter, Wikipedia, Flickr) with a robust HTML parser and web spider. Its purpose is to retrieve online content in an easy-to-use, uniform way.


Python has a built in one that's actually pretty quick, found here. There's also a really powerful one called Beautiful Soup that offers additional functionality, especially for HTML scraping.

However, I also have to ask why not use the search API?


Finally found a nice suite BootCat.

0

精彩评论

暂无评论...
验证码 换一张
取 消