开发者

How to load ajax with HtmlUnit?

开发者 https://www.devze.com 2023-03-22 03:50 出处:网络
import java.io.IOException; import java.net.MalformedURLException; import java.util.List; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlA开发者_如何学JAVAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;

public class YoutubeBot {
private static final String YOUTUBE = "http://www.youtube.com";

public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
    WebClient webClient = new WebClient();
    webClient.setThrowExceptionOnScriptError(false);

    // This is equivalent to typing youtube.com to the adress bar of browser
    HtmlPage currentPage = webClient.getPage("http://www.youtube.com/results?search_type=videos&search_query=official+music+video&search_sort=video_date_uploaded&suggested_categories=10%2C24&uni=3");

    // Get form where submit button is located
    HtmlForm searchForm = (HtmlForm) currentPage.getElementById("masthead-search");

    // Get the input field.
    HtmlTextInput searchInput = (HtmlTextInput) currentPage.getElementById("masthead-search-term");
    // Insert the search term.
    searchInput.setText("java");

    // Workaround: create a 'fake' button and add it to the form.
    HtmlButton submitButton = (HtmlButton) currentPage.createElement("button");
    submitButton.setAttribute("type", "submit");
    searchForm.appendChild(submitButton);

    //Workaround: use the reference to the button to submit the form. 
    HtmlPage newPage = submitButton.click();

    //Find all links on page with given class
    final List<HtmlAnchor> listLinks = (List<HtmlAnchor>) currentPage.getByXPath("//a[@class='ux-thumb-wrap result-item-thumb']");      

    //Print all links to console
    for (int i=0; i<listLinks.size(); i++)
        System.out.println(YOUTUBE + listLinks.get(i).getAttribute("href"));

    }
}

This code is working but I just want to sort youtube clips for example by upload date. How to do this with HtmlUnit? I have to click on filter, this should load content by ajax request and then I should click on "Upload date" link. I just don't know this first step, to load ajax content. Is this possible with HtmlUnit?


This worked for me. Set this

webClient.setAjaxController(new NicelyResynchronizingAjaxController());

This would cause all ajax calls to be synchronous.

This is how I setup my WebClient object

WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getCookieManager().setCookiesEnabled(true);


Here's one way to do it:

  1. Search the page as you did in your previous question.
  2. Select search-lego-refinements block by id.
  3. Use XPath to navigate to the URL (//ul/li/a when you start from the previous id).
  4. Click the selected link.

The following code sample shows how it could be done:

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;

public class YoutubeBot {
   private static final String YOUTUBE = "http://www.youtube.com";

   @SuppressWarnings("unchecked")
   public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
      WebClient webClient = new WebClient();
      webClient.setThrowExceptionOnScriptError(false);

      // This is equivalent to typing youtube.com to the adress bar of browser
      HtmlPage currentPage = webClient.getPage(YOUTUBE);

      // Get form where submit button is located
      HtmlForm searchForm = (HtmlForm) currentPage.getElementById("masthead-search");

      // Get the input field
      HtmlTextInput searchInput = (HtmlTextInput) currentPage.getElementById("masthead-search-term");

      // Insert the search term
      searchInput.setText("java");

      // Workaround: create a 'fake' button and add it to the form
      HtmlButton submitButton = (HtmlButton) currentPage.createElement("button");
      submitButton.setAttribute("type", "submit");
      searchForm.appendChild(submitButton);

      // Workaround: use the reference to the button to submit the form.
      currentPage = submitButton.click();

      // Get the div containing the filters
      HtmlElement filterDiv = currentPage.getElementById("search-lego-refinements");

      // Select the first link from the filter block (Upload date)
      HtmlAnchor sortByDateLink = ((List<HtmlAnchor>) filterDiv.getByXPath("//ul/li/a")).get(0);

      // Click the 'Upload date' link
      currentPage = sortByDateLink.click();

      System.out.println(currentPage.asText());
   }
}

You could just browse the correct query URL as well (http://www.youtube.com/results?search_type=videos&search_query=nyan+cat&search_sort=video_date_uploaded).

But then you would have to encode your search parameter(s) (replace spaces with + for example).


I've played with HTMLUnit earlier for similar purposes.

Actually you can find all information you need here. HTMLUnit has AJAX support enabled by default so when you get the newPage object in your code you can issue click events on the page (finding the specific element and call it's click() function). The trickiest part is that AJAX is asynchronous so you have to call wait() or sleep() after performing virtual click so Javascript code on the site could process the actions. This is not the best approach since network usage makes sleep() unreliable. You may find some thing on the page which changes when you execute an event making AJAX calls (eg. a header title changes) so you can check regularly if this change has already happened to the site or not. (I should mention that there's an event resynchronizer built in to HTMLUnit, however i couldn't manage to make it work as i expected it to be.) I use Firebug or Chrome's developer toolbar for examining the site. You could check out the DOM tree before and after AJAX calls and this way you'll know how to reference specific controls (like links and dropdown menus) on the page.

I would use XPath to get specific elements then, eg. you can do this (from HTML Unit's examples):

//get div which has a 'name' attribute of 'John'
final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[@name='John']").get(0);

YouTube actually not uses AJAX for resorting it's result. When you click the sort dropdown on the result page (this is a decorated <button>) an absolute positioned <ul> shows up (this emulates the drop-down part of the combo) which has <li> elements for each menu item. <li> elements contain a special <span> element with a href attribute attached. When you click the <span> element Javascript navigates the browser to this href value.

For eg. in my case the sort by relevance <span> element looks like this:

<span href="/results?search_type=videos&amp;search_query=test&amp;suggested_categories=2%2C24%2C10%2C1%2C28" class=" yt-uix-button-menu-item" onclick=";window.location.href=this.getAttribute('href');return false;">Relevancia</span>

You can get the list of these spans relatively easily since the hosting <ul> is the only such child of <body>. Although you have to click on the dropdown button first because it'll create the <ul> element with all childs described above using Javascript. You can get the sort by button with this XPath:

//div[@class='sort-by floatR']/button

You can test your XPath queries eg. right in Chrome if you open the developer tools and the Javascript developer console from it's toolbar. Then you can test like this:

>  $x("//div[@class='sort-by floatR']/button")

[
<button type=​"button" class=​" yt-uix-button yt-uix-button-text yt-uix-button-active" onclick=​";​return false;​" role=​"button" aria-pressed=​"true" aria-expanded=​"true" aria-haspopup=​"true" aria-activedescendant data-button-listener=​"26">​…​</button>​
]

Hope this'll get you to the right direction.


http://htmlunit.sourceforge.net/faq.html#AJAXDoesNotWork

0

精彩评论

暂无评论...
验证码 换一张
取 消