Code hangs when trying to get the response code_问答_开发者

Code hangs when trying to get the response code

开发者 https://www.devze.com 2023-02-28 06:52 出处：网络

I am trying to crawl 300,000 URLs. However, somewhere in the middle the code hangs when trying to retrieve the response code from a URL. I am not sure what is going wrong since a connection is being established but the problem is occurring after that. Any suggestions/pointers will be greatly appreciated. Also, is there any way to ping a website for a certain time period and if it's not responding just proceed to the next one?

I have modified the code as per the suggestions having set the read time out and the request property as suggested.However, even now the code is unable to obtain the 开发者_运维技巧response code!

Here is my modified code snippet:

URL url=null;

try
{
    Thread.sleep(8000);
}
catch (InterruptedException e1)
{
    e1.printStackTrace();
}

try
{
    //urlToBeCrawled comes from the database
    url=new URL(urlToBeCrawled);
}
catch (MalformedURLException e)
{
    e.printStackTrace();
    //The code is in a loop,so the use of continue.I apologize for putting code in the catch block.
    continue;
}
HttpURLConnection huc=null;
try
{
    huc = (HttpURLConnection)url.openConnection();

}
catch (IOException e)
{
    e.printStackTrace();
}
try
{
   //Added the request property
    huc.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
    huc.setRequestMethod("HEAD");

}
catch (ProtocolException e)
{
    e.printStackTrace();
}

huc.setConnectTimeout(1000);
try
{
    huc.connect();

}
catch (IOException e)
{

    e.printStackTrace();
    continue;
}

int responseCode=0;
try
{
    //Sets the read timeout
    huc.setReadTimeout(15000);
    //Code hangs here for some URL which is random in each run
    responseCode = huc.getResponseCode();

}
catch (IOException e)
{
    huc.disconnect();

    e.printStackTrace();
    continue;
}
if (responseCode!=200)
{
    huc.disconnect();
    continue;
}

A server is holding the connection open but also is not responding. It may even be detecting that you're spidering their site and the firewall or anti-DDOS tools are intentionally trying to confuse you. Be sure you set a user-agent (some servers will get angry if you don't). Also, set a read timeout so that if it fails to read after awhile, it'll give up:

huc.setReadTimeout(15000);

This really should be done using multi-threading. Especially if you are attempting 300,000 URLs. I prefer the thread-pool approach for this.

Second, you will really benefit better from a more robust HTTP client such as the apache commons http client as it can better set the user-agent. Whereas the most JRE's will not allow you to modify the user-agent using the HttpURLConnection class (they force it to your JDK version, eg: Java/1.6.0_13 will be your user-agent.) There are tricks to change this by adjusting the system property but I have never seen that actually work. Again go just go with Apache Commons HTTP library, you won't regret it.

Finally you need a good http debugger to deal with this ultimately, You can use Fiddler2, and just setup a java proxy to point to fiddler (scroll to the part about Java).