We are moving large amounts of data on a LAN and it has to happen very rapidly and reliably. Currently we use windows TCP as implemented in C++. Using large (synchronous) sends moves the data much faster than a bunch of smaller (synchronous) sends but will frequently deadlock for large gaps of time (.15 seconds) causing the overall transfer rate to plummet. This deadlock happens in very particular circumstances which makes me believe it should be preventable altogether. More importantly if we don't really know the cause we don't really know it won't happen some time with smaller sends anyway. Can anyone explain this deadlock?
Deadlock description (OK, zombie-locked, it isn't dead, but for .15 or so seconds it stops, then starts aga开发者_如何学编程in)
- The receiving side sends an ACK.
- The sending side sends a packet containing the end of a message (push flag is set)
- The call to socket.recv takes about .15 seconds(!) to return
- About the time the call returns an ACK is sent by the receiving side
- The the next packet from the sender is finally sent (why is it waiting? the tcp window is plenty big)
The odd thing about (3) is that typically that call doesn't take much time at all and receives exactly the same amount of data. On a 2Ghz machine that's 300 million instructions worth of time. I am assuming the call doesn't (heaven forbid) wait for the received data to be acked before it returns, so the ack must be waiting for the call to return, or both must be delayed by something else.
The problem NEVER happens when there is a second packet of data (part of the same message) arriving between 1 and 2. That part very clearly makes it sound like it has to do with the fact that windows TCP will not send back a no-data ACK until either a second packet arrives or a 200ms timer expires. However the delay is less than 200 ms (its more like 150 ms).
The third unseemly character (and to my mind the real culprit) is (5). Send is definitely being called well before that .15 seconds is up, but the data NEVER hits the wire before that ack returns. That is the most bizarre part of this deadlock to me. Its not a tcp blockage because the TCP window is plenty big since we set SO_RCVBUF to something like 500*1460 (which is still under a meg). The data is coming in very fast (basically there is a loop spinning out data via send) so the buffer should fill almost immediately. Msdn mentions that there various "heuristics" used in deciding when a send hits the wire, and that an already pending send + a full buffer will cause send to block until the data hits the wire (otherwise send apparently really just copies data into the tcp send buffer and returns).
Anway, why the sender doesn't actually send more data during that .15 second pause is the most bizarre part to me. The information above was captured on the receiving side via wireshark (except of course the socket.recv return times which were logged in a text file). We tried changing the send buffer to zero and turning off nagel on the sender (yes, I know nagel is about not sending small packets - but we tried turning nagel off in case that was part of the unstated "heuristics" affecting whether the message would be posted to the wire. Technically microsoft's nagel is that a small packet isn't sent if the buffer is full and there is an outstanding ACK, so it seemed like a possibility).
The send blocking until the previous ACK
is received almost certainly indicates that the TCP receive window is full (you can check this by using Wireshark to analyse the network traffic).
No matter how big your TCP window is, if the receiving application isn't processing data as fast as it's arriving then the TCP window will eventually fill up. How fast are we talking here? What is the receiving side doing with the data? (If you're writing the received data to disk then it's quite possible that your disk just can't keep up with a gigabit network at full bore).
OK, so you have a 730,000 byte receive window and you're streaming data at 480Mbps. That means it takes only 12ms to entirely fill your window - so when the 150ms delay on the receive side occurs, the receive window fills up almost instantly and causes the sender to stall.
So your root cause is this 150ms delay in scheduling your receive process. There's any number of things that could cause that (it could be as simple as the kernel needing to flush dirty pages to disk to create some more free pages for your application); you could try increasing your processes scheduling priority, but there's no guarantee that that will help.
精彩评论