I use htmlunit library for scrapping Yellowpages.com site. I want to type search term into and click on Find button. But after that I get 2 pages: http://www.yellowpages.com/ny/sport?g=NY&q=Sport and https://dealoftheday.yellowpages.com/join?ic=deal_pop-under_signup-v- First one is what I want, second one is popup. I have this code:
public void getPage() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient webClient = new WebClient();
page = webClient.getPage("http://www.yellowpages.com");
HtmlTextInput searchInput = (HtmlTextInput) page.getElementById("search-terms");
searchInput.setText("Law");
HtmlSubmitInput button = (HtmlSubmitInput) page.getElementById("search-submit");
page = button.click();
System.out.println(page.getTitleText());
}
This code prints:
Deal of the Day on YP.com - Join
But I want to print first page title, which is:
NY Sport | Sport in NY - YP.com
How to get first page?
EDIT: After adding line webClient.setPopupBlockerEnabled(true), I got a lot of warnings and after that I got exceptions. Here is a part of console output:
Exception in thread "main" ======= EXCEPTION START ======== EcmaError: lineNumber=[56] column=[0] lineSource=[null] name=[TypeError] sourceName=[http://i2.ypcdn.com/webyp/javascripts/home_packaged.js?13455] message=[TypeError: Cannot call method "blur" of null (http://i2.ypcdn.com/webyp/javascripts/home_packaged.js?13455#56)] com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "blur" of null (http://i2.ypcdn.com/webyp/javascripts/home_packaged.js?13455#56) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:531) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:906) at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeEventListeners(EventListenersContainer.java:164) at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeBubblingListeners(EventListenersContainer.java:223) at com.gargoylesoftware.htmlunit.javascript.host.Node.fireEvent(Node.java:686) at com.gargoylesoftware.htmlunit.html.HtmlElement$2.run(HtmlElement.java:885) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538) at com.gargoylesoftware.htmlunit.html.HtmlElement.fireEvent(HtmlElement.java:890) at com.gargoylesoftware.htmlunit.html.HtmlElement.fireEvent(HtmlElement.java:865) at com.gargoylesoftware.htmlunit.html.HtmlForm.submit(HtmlForm.java:108) at com.gargoylesoftware.htmlunit.html.HtmlSubmitInput.doClickAction(HtmlSubmitInput.java:77) at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1263) at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1214) at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1177) at YellowPages.getPage(YellowPages.java:39) at YellowPages.main(YellowPages.java:22) Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot call method "blur" of null (http://i2.ypcdn.com/webyp/javascripts/home_packaged.js?13455#56) at net.sourceforge.htmlunit.corejs.javascri开发者_运维问答pt.ScriptRuntime.constructError(ScriptRuntime.java:3772) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3750) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3778)
Sounds like a JS error. Disable JS:
webClient.setJavaScriptEnabled(false);
And what about?
webClient.setThrowExceptionOnScriptError(false);
Add webClient.getOptions()
if using HtmlUnit 2.11+
Have you tried
webClient.setPopupBlockerEnabled(true)
Then you should get only one page
Not tested, but I think you might iterate through the WebClient's top level windows (using WebClient.getTopLevelWindows()
), call getEnclosedPage()
and test if the title text of the page is the one you're looking for.
精彩评论