开发者

Writing a simple web crawler that interacts with the browser (Java)

开发者 https://www.devze.com 2023-01-07 14:47 出处:网络
I need to create an automated process (p开发者_开发知识库referably using Java) that will: Open browser with specific url.

I need to create an automated process (p开发者_开发知识库referably using Java) that will:

  1. Open browser with specific url.
  2. Login, using the username and password specified.
  3. Follow one of the links on the page.
  4. Refresh the browser.
  5. Log out.

This is basically done to gather some statistics for analysis. Every time a user follows the link a bunch of data is generated for this particular user and saved in database. The thing I need to do is, using around 10 fake users, ping the page every 5-15 min.

Can you tink about simple way of doing that? There has to be an alternative to endless login-refresh-logout manual process...


Try Selenium.


It's not Java, but Javascript. You could do something like:

window.location = "<url>"
document.getElementById("username").value = "<email>";    
document.getElementById("password").value = "<password>";

document.getElementById("login_box_button").click();

...

etc

With this kind of structure you can easily cover 1-3. Throw in some for loops for page refreshes and you're done.


Use HtmlUnit if you want

  1. FAST
  2. SIMPLE

java based web interaction/crawling.

For example: here is some simple code showing a bunch of output and an example of accessing all IMG elements of the loaded page.

public class HtmlUnitTest {
  public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
    final WebClient webClient = new WebClient();
    final HtmlPage page = webClient.getPage("http://www.google.com");
    System.out.println(page.getTitleText());

    for (HtmlElement node : page.getHtmlElementDescendants()) {
      if (node.getTagName().toUpperCase().equals("IMG")) {
        System.out.println("NAME: " + node.getTagName());
        System.out.println("WIDTH:" + node.getAttribute("width"));
        System.out.println("HEIGHT:" + node.getAttribute("height"));
        System.out.println("TEXT: " + node.asText());
        System.out.println("XMl: " + node.asXml());
      }
    }
  }
}

Example #2 Accessing named input fields and entering data/clicking:

final HtmlPage page = webClient.getPage("http://www.google.com");

HtmlElement inputField = page.getElementByName("q");
inputField.type("Example input");

HtmlElement btnG = page.getElementByName("btnG");
Page secondPage = btnG.click();

if (secondPage instanceof HtmlPage) {
  System.out.println(page.getTitleText());
  System.out.println(((HtmlPage)secondPage).getTitleText());
}

NB: You can use page.refresh() on any Page object.


You could use Jakarta JMeter

0

精彩评论

暂无评论...
验证码 换一张
取 消