开发者

Best HTTP library for Java?

开发者 https://www.devze.com 2023-01-15 15:17 出处:网络
I wand to develop http client in Java for college project which login to site, obtain data from HTML data, complete and send forms.

I wand to develop http client in Java for college project which login to site, obtain data from HTML data, complete and send forms. I don't know which http lib to use : Apache HTTP client - don't create DOM model but work with http redirects, multi threading. HTTPUnit - create DOM mo开发者_C百科del and is easy to work with forms, fields, tables etc. but I don't know how will work with multi-threading and proxy settings.

Any advice ?


It sounds like you are trying to create a web-scraping application. For this purpose, I recommend the HtmlUnit library.

It makes it easy to work with forms, proxies, and data embedded in web pages. Under the hood I think it uses Apache's HttpClient to handle HTTP requests, but this is probably too low-level for you to be worried about.

With this library you can control a web page in Java the same way you would control it in a web browser: clicking a button, typing text, selecting values.

Here are some examples from HtmlUnit's getting started page:

Submitting a form:

@Test
public void submittingForm() throws Exception {
    final WebClient webClient = new WebClient();

    // Get the first page
    final HtmlPage page1 = webClient.getPage("http://some_url");

    // Get the form that we are dealing with and within that form, 
    // find the submit button and the field that we want to change.
    final HtmlForm form = page1.getFormByName("myform");

    final HtmlSubmitInput button = form.getInputByName("submitbutton");
    final HtmlTextInput textField = form.getInputByName("userid");

    // Change the value of the text field
    textField.setValueAttribute("root");

    // Now submit the form by clicking the button and get back the second page.
    final HtmlPage page2 = button.click();

    webClient.closeAllWindows();
}

Using a proxy server:

@Test
public void homePage_proxy() throws Exception {
    final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_2, "http://myproxyserver", myProxyPort);

    //set proxy username and password 
    final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
    credentialsProvider.addProxyCredentials("username", "password");

    final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
    assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());

    webClient.closeAllWindows();
}

The WebClient class is single threaded, so every thread that deals with a web page will need its own WebClient instance.

Unless you need to process Javascript or CSS, you can also disable these when you create the client:

WebClient client = new WebClient();
client.setJavaScriptEnabled(false);
client.setCssEnabled(false);


HTTPUnit is meant for testing purposes, I don't think it is best suited to be embedded inside your application.

When you want to consume HTTP resources (like webpages) I'd recommend Apache HTTPClient. But you may find this framework to low level for your use case which is webpage scraping. So I'd recommend an integration framework like Apache Camel for this purpose. For example the following route reads a webpage (using Apache HTTPClient), transforms the HTML to well-formed HTML (using TagSoup) and transforms the result to a XML representation for further processing.

from("http://mycollege.edu/somepage.html).unmarshall().tidyMarkup().to("xslt:mystylesheet.xsl")

You can further process the resulting XML using XPath or transform it to a POJO using JAXB for example.


HTTPUnit is for unit testing. Unless you mean "testing client", I don't think it's appropriate for creating an application.

I wand to develop http client in Java

You realize, of course, that the Apache HTTP client is not your answer either. You sound like you want to create a first web app.

You'll need servlets and JSPs. Get Apache's Tomcat and learn enough JSP and JSTL to do what you need to do. Don't bother with frameworks, since it's your first.

When you have it running, then try a framework like Spring.


It seems to be a cURL support for java :
http://curl.haxx.se/libcurl/java/


Depends on how complex your websites are. Options are Apache HttpClient (plus something like JTidy) or testing-oriented packages like HtmlUnit or Canoo WebTest. HtmlUnit is quite powerful - you'd be able to process JavaScript, for instance.


Jetty has a nice client side library. I like to use it because I often need to create a server along with the client. The Apache HTTP Client is really good and seems to have some more features that work like being able to resolve a proxy using SSL.


If you really want to simulate a browser, then Selenium RC

0

精彩评论

暂无评论...
验证码 换一张
取 消