开发者

How to get right page?

开发者 https://www.devze.com 2023-04-08 05:55 出处:网络
I use htmlunit library for scrapping Yellowpages.com site. I want to type search term into and click on Find button. But after that I get 2 pages: http://www.yellowpages.com/ny/sport?g=NY&q=Sport

I use htmlunit library for scrapping Yellowpages.com site. I want to type search term into and click on Find button. But after that I get 2 pages: http://www.yellowpages.com/ny/sport?g=NY&q=Sport and https://dealoftheday.yellowpages.com/join?ic=deal_pop-under_signup-v- First one is what I want, second one is popup. I have this code:

public void getPage() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        WebClient webClient = new WebClient();
        page = webClient.getPage("http://www.yellowpages.com");
        HtmlTextInput searchInput = (HtmlTextInput) page.getElementById("search-terms");
        searchInput.setText("Law");

        HtmlSubmitInput button = (HtmlSubmitInput) page.getElementById("search-submit");
        page = button.click();
        System.out.println(page.getTitleText());

    }

This code prints:

Deal of the Day on YP.com - Join

But I want to print first page title, which is:

NY Sport | Sport in NY - YP.com

How to get first page?

EDIT: After adding line webClient.setPopupBlockerEnabled(true), I got a lot of warnings and after that I got exceptions. Here is a part of console output:

Exception in thread "main" ======= EXCEPTION START ======== EcmaError: lineNumber=[56] column=[0] lineSource=[null] name=[TypeError] sourceName=[http://i2.ypcdn.com/webyp/javascripts/home_packaged.js?13455] message=[TypeError: Cannot call method "blur" of null (http://i2.ypcdn.com/webyp/javascripts/home_packaged.js?13455#56)] com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "blur" of null (http://i2.ypcdn.com/webyp/javascripts/home_packaged.js?13455#56) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:531) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:906) at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeEventListeners(EventListenersContainer.java:164) at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeBubblingListeners(EventListenersContainer.java:223) at com.gargoylesoftware.htmlunit.javascript.host.Node.fireEvent(Node.java:686) at com.gargoylesoftware.htmlunit.html.HtmlElement$2.run(HtmlElement.java:885) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538) at com.gargoylesoftware.htmlunit.html.HtmlElement.fireEvent(HtmlElement.java:890) at com.gargoylesoftware.htmlunit.html.HtmlElement.fireEvent(HtmlElement.java:865) at com.gargoylesoftware.htmlunit.html.HtmlForm.submit(HtmlForm.java:108) at com.gargoylesoftware.htmlunit.html.HtmlSubmitInput.doClickAction(HtmlSubmitInput.java:77) at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1263) at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1214) at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1177) at YellowPages.getPage(YellowPages.java:39) at YellowPages.main(YellowPages.java:22) Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot call method "blur" of null (http://i2.ypcdn.com/webyp/javascripts/home_packaged.js?13455#56) at net.sourceforge.htmlunit.corejs.javascri开发者_运维问答pt.ScriptRuntime.constructError(ScriptRuntime.java:3772) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3750) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3778)


Sounds like a JS error. Disable JS:

webClient.setJavaScriptEnabled(false);

And what about?

webClient.setThrowExceptionOnScriptError(false);

Add webClient.getOptions() if using HtmlUnit 2.11+


Have you tried

webClient.setPopupBlockerEnabled(true)

Then you should get only one page


Not tested, but I think you might iterate through the WebClient's top level windows (using WebClient.getTopLevelWindows()), call getEnclosedPage() and test if the title text of the page is the one you're looking for.

0

精彩评论

暂无评论...
验证码 换一张
取 消