开发者

Delay in multiple TCP connections from Java to the same machine

开发者 https://www.devze.com 2022-12-13 12:01 出处:网络
(See this question in ServerFault) I have a Java client that uses Socket to open concurrent connections to the same machine. I am witnessing a phenomenon where one request completes extremely fast, b

(See this question in ServerFault)

I have a Java client that uses Socket to open concurrent connections to the same machine. I am witnessing a phenomenon where one request completes extremely fast, but the others see a delay of 100-3000 milliseconds. Packet inspection using Wireshark shows all SYN packets beyond the first wait a long time before leaving the client. I am seeing this on both Windows and Linux clients. What could be causing this? This happens when the client is a Windows 2008 or a Linux box.

Code attached:

import java.util.*;
import java.net.*;

public class Tester {
    public static void main(String[] args) throws Exception {
        if (args.length < 3) {
            usage();
            return;
        }
        final int n = Integer.parseInt(args[0]);
        final String ip = args[1];
        final int port = Integer.parseInt(args[2]);

        ExecutorService executor = Executors.newFixedThreadPool(n);

        ArrayList<Callable<Long>> tasks = new ArrayList<Callable<Long>>();
        for (int i = 0; i < n; ++i)
            tasks.add(new Callable<Long>() {
                public Long call() {
                    Date before = new Date();
                    try {
                        Socket socket = new Socket();
                        socket.connect(new InetSocketAddress(ip, port));
                    }

                    catch (Throwable e) {
                        e.printStackTrace();
                    }
                    Date after = new Date();
                    return after.getTime() - before.getTime();
                }
            });
        System.out.println("Invoking");
        List<Future<Long>> results = executor.invokeAll(tasks);
        System.out.println("Invoked");
        for (Future<Long> future : results) {
            System.out.println(future.get());
        }
        executor.shutdown();
    }

    private static void usage() {
  开发者_开发问答      System.out.println("Usage: prog <threads> <url/IP Port>");
        System.out.println("Examples:");
        System.out.println("  prog tcp 10 127.0.0.1 2000");
    }
}

Update - the problem reproduces consistently if I clear the relevant ARP entry before running the test program. I've tried tuning the TCP retransmission timeout, but that didn't help. Also, we ported this program to .Net, but the problem still happens.

Updated 2 - 3 seconds is the specified delay in creating new connections, from RFC 1122. I still don't fully understand why there is a retransmission here, it should be handled by the MAC layer. Also, we reproduced the problem using netcat, so it has nothing to do with java.


It looks like you use a single underlying HTTP connection. So other request can't be done before you call close() on the InputStream of an HttpURLConnection, i. e. before you process the response.

Or you should use a pool of HTTP connections.


You are doing the right thing in reducing the size of the problem space. On the surface this is an impossible problem - something that moves between IP stacks, languages and machines, and yet is not arbitrarily reproducible (e.g. I cannot repro using your code on Windows nor Linux).

Some suggestions, going from the top of the stack to the bottom:

  • Code -- you say this happens on .Net and Java. Are there any language/compiler combinations for which it does not happen? I used your client talking to the SocketTest program from sourceforge and also "nc" with identical results - no delays. Similarly JDK 1.5 vs 1.6 made no difference for me.

    -- Suppose you pace the speed at which the client sends requests, say one every 500ms. Does the problem repro?

  • IP stack -- maybe something is getting stuck in the stack on the way out. I see you've ruled out Nagle but don't forget silly stuff like firewalls/ip tables. I'd find it hard to believe that the TCP stack on Win and Linux was that hosed, but you never know.

    -- loopback interface handling can be freaky. Does it repro when you use the machine's real IP? What about across the network (or better, back-to-back with a x-over cable to another machine)?

  • NIC -- if the packets are making it to the cards, consider features of the cards (TCP offload or other 'special' handling) or quirks in the NICs themselves. Do you get the same results with other brands of NIC?


I haven't found a real answer from this discussion. The best theory I've come up with is:

  1. TCP layer sends a SYN to the MAC layer. This happens from several threads.
  2. First thread sees that IP has no match in the ARP table, sends an ARP request.
  3. Subsequent threads see there is a pending ARP request so they drop the packet altogether. This behavior is probably implemented in the kernel of several operating systems!
  4. ARP reply returns, the original SYN request from the first thread leaves the machine and a TCP connection is established.
  5. TCP layer waits 3 seconds as stated in RFC 1122, then retries and succeeds.

I've tried tweaking the timeout in Windows 7 but wasn't successful. If anyone can reproduce the problem and provide a workaround, I'll be most helpful. Also, if anyone has more details on why exactly this phenomenon happens only with multiple threads, it would be interesting to hear.

I'll try to accept this answer as I don't think any of the answers provided a true explanation (see this discussion on meta).


If either of the machines is a windows box, I'd take a look at the Max Concurrent Connections on both. See: http://www.speedguide.net/read_articles.php?id=1497

I think this is a app-level limit in some cases, so you'll have to follow the guide to raise them.

In addition, if this is what happens, you should see something in the System Event Log on the offending machine.


Java client that uses HttpURLConnection to open concurrent connections to the same machine.

The same machine? What application does the clients accept? If you wrote that program by yourself, maybe you have to time how fast your server can accept clients. Maybe it is just a bad (or not fast working) written server application. The servercode looks like this, I think;

ServerSocket ss = ...;
while (acceptingMoreClients)
{
   Socket s = ss.accept();
   // On this moment the client is connected to the server, so start timing.
   long start = System.currentTimeMillis();
   ClientHandler handler = new ClientHandler(s);
   handler.start();

   // After "handler.start();" the handler thread is started,
   // So the next two commands will be very fast done.
   // That means the server is ready to accept a new client.
   // Stop timing.
   long stop = System.currentTimeMillis();
   System.out.println("Client accepted in " + (stop - start) + " millis");
}

If this result are bad, than you know where the problem is situated.
I hope this helps you closer to the solution.


Question:

To do the test, do you use the ip you recieved from the DHCP server or 127.0.0.1 If that from the DHCP-Server, everything goes thru the router/switch/... from your company. That can slow down the whole process.

Otherwise:

  • In Windows all TCP-traffic (localhost to localhost) will be redirected in the software-layer of the system (not the hardware-layer), that is why you cannot see TCP-traffic with Wireshark. Wireshark only sees the traffic that passes the hardware-layer.
  • Linux: Wireshark can only see the traffic at the hardware-layer. Linux doesn't redirect on the software-layer. That is also the reason why InetAddress.getLocalhost().getAddress() 127.0.0.1 returns.

  • So when you use Windows, it is very normal you cannot see the SYN packet, with Wireshark.

Martijn.


The fact that you see this on multiple clients, with different OS's, and with different application environments on (I assume) the same OS is a strong indication that it's a problem with either the network or the server, not the client. This is reinforced by your comment that clearing the ARP table reproduces the problem.

Do you perhaps have two machines on the switch with the same MAC address? (one of which will probably be a router that's spoofing the MAC address).

Or more likely, if I recall ARP correctly, two machines that have the same hardcoded IP address. When the client sends out "who is IP 123.456.123.456", both will answer, but only one will actually be listening.

Another possibility (I've seen this happen in a corporate environment) is a rogue DHCP server, again giving out the same IP addresses to two machines.


Since the problem isn't reproducible unless you clear the associated ARP cache, what does the entire packet trace look like from a timing perspective, from the time the ARP request is issued until after the 3 second delay?

What happens if you open connections to two different IPs? Will the first connections to both succeed? If so, that should rule out any JVM or library issues.

The first SYN can't be sent until the ARP response arrives. Maybe the OS or TCP stack uses a timeout instead of an event for threads beyond the first one that try to open a connection when the associated MAC address isn't known.

Imagine the following scenario:

  1. Thread #1 tries to connect, but the SYN can't be sent because the ARP cache is empty, so it queues the ARP request.
  2. Next, Thread #2 (through #N) tries to connect. It also can't send the SYN packet because the ARP cache is empty. This time, though, instead of sending another ARP request, the thread goes to sleep for 3 seconds, as it says in the RFC.
  3. Next, the ARP response arrives. Thread #1 wakes up immediately and sends the SYN.
  4. Thread #2 isn't waiting on the ARP request; it has a hard-coded 3-second sleep. So after 3 seconds, it wakes up, finds the ARP entry it needs, and sends the SYN.


I have seen similar behavior when I was getting DNS timeouts. To test this, you can either use the IP address directly or enter the IP address in your hosts file.


Does setting socket.setTcpNoDelay( true ) help?


Have you tried to see what system calls are made by running your client with strace.

It's been very helpful to me in the past, while debugging some mysterious networking issues.


What is the listen backlog on the server? How quickly is it accepting connections? If the backlog fills up, the OS ignores connection attempts. 3 seconds later, the client tries again and gets in now that the backlog has cleared.

0

精彩评论

暂无评论...
验证码 换一张
取 消