开发者

Java API for web scraping or web mining [duplicate]

开发者 https://www.devze.com 2023-02-16 00:31 出处:网络
This question already has answers here: What are the pros and cons of the leading Java HTML parsers? [closed]
This question already has answers here: What are the pros and cons of the leading Java HTML parsers? [closed] 开发者_运维百科 (6 answers) Closed 5 years ago.

I'm looking for a good Java api to do web scraping. I tried WEB-Harvest api http://web-harvest.sourceforge.net/usage.php but I think it's a bit clunky. Any other suggestions?


I've used httpunit to do just this task in production.


http://hc.apache.org/httpcomponents-client-ga/

(Maven Dependency)

<dependency>
  <groupId>commons-httpclient</groupId> 
  <artifactId>commons-httpclient</artifactId> 
  <version>3.1</version> 
</dependency>


I use this: https://github.com/subes/invesdwin-webproxy

It supports HttpClient and HtmlUnit (headless browser that supports javascript) and parallelizes it if required over a large pool of proxies. I can also recommend JSoup for static html processing.

0

精彩评论

暂无评论...
验证码 换一张
取 消