开发者

Crawler leaves lots of ESTABLISHED TCP sockets to some servers

开发者 https://www.devze.com 2023-02-06 01:18 出处:网络
I\'ve got a Java web crawler. I\'ve noticed that for a small number of servers I crawl I am left with a large number of ESTABLISHED sockets:

I've got a Java web crawler. I've noticed that for a small number of servers I crawl I am left with a large number of ESTABLISHED sockets:

joel@bohr:~/tmp/test$ lsof -p 6760 | grep TCP 
java    6760 joel  105u  IPv6      96546      0t0      TCP bohr:55602->174.143.223.193:www (ESTABLISHED)
java    6760 joel  109u  IPv6      96574      0t0      TCP bohr:55623->174.143.223.193:www (ESTABLISHED)
java    6760 joel  110u  IPv6      96622      0t0      TCP bohr:55644->174.143.223.193:www (ESTABLISHED)
java    6760 joel  111u  IPv6      96674      0t0      TCP bohr:55665->174.143.223.193:www (ESTABLISHED)

There could be many tens of these to any one server & I cann't figure out why they are being left open.

I'm using HttpURLConnection to establish a connection and read data. HTTP 1.1 and keep-alive is on (by default). It's my understanding that the underlying tcp socket to a remote server will be re-used by Java's HttpURLConnection, so long as I close the input/error stream, and all data is read from the stream. It's also my understanding that if an exception is thrown, then so l开发者_StackOverflow社区ong as the input/error stream is closed (if not null) then the socket, although not re-used again, will be closed. (java handling of http-keepalive)

My abbreviated code looks like this:

  InputStream is = null;
  try { 
   HttpURLConnection conn = (HttpURLConnection) uri.toURL().openConnection();
   conn.setReadTimeout(10000);
   conn.setConnectTimeout(10000);
   conn.setRequestProperty("User-Agent", userAgent);
   conn.setRequestProperty("Accept", "text/html,text/xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
   conn.setRequestProperty("Accept-Encoding", "gzip deflate");
   conn.setRequestProperty("Accept-Language", "en-gb,en;q=0.5");
   conn.connect();

   try {
    int responseCode = conn.getResponseCode();
    is = conn.getInputStream();   

   } catch (IOException e) {     
    is = conn.getErrorStream();
    if (is != null){ 
     // consume the error stream, http://download.oracle.com/javase/6/docs/technotes/guides/net/http-keepalive.html 
     StreamUtils.readStreamToBytes(is, -1 , MAX_LN); 
    }
    throw e;
   }

   String type = conn.getContentType();

   byte[] response = StreamUtils.readStream(is);
    // do something with content


  }  catch (Exception e) {
        conn.disconnect(); // don't try to re-use socket - just be done with it.
    throw e;

} finally {
   if (is != null) {
    is.close();
   }
  }

I've noticed that for a site where this is happening I get a lot of IOExceptions thrown when making GET requests, due to:

java.net.ProtocolException: Server redirected too many  times (20)

I'm pretty sure I'm handling this, closing the socket properly. Could it really be this, or something else I'm doing wrong? Could it be a result of mis-using keep-alive - and if so how to fix it? I'd rather not have to turn keep-alive off to fix the problem.

EDIT: I've tested setting the following property:

        conn.setRequestProperty("Connection", "close"); // supposed to disable keep-alive

Sending the Connection: close header disabled persistent tcp connections and all sockets are eventually cleaned up. So, it would seem that the problem I am seeing is indeed to do with keep-alive and sockets not being closed correctly, even after closing the input stream.

EDIT2 - could it be that one socket is created everytime the request is redirected? Where this problem is noticeable the request is being redirected 20 times before the exception above is thrown. If this were the case is there a way of limiting the number of redirects on a URLConnection?


You need to move conn.disconnect() into your finally section. As it is you only disconnect if there's an exception thrown.

0

精彩评论

暂无评论...
验证码 换一张
取 消