We have an linux application (we don't have the source) that seems to be hanging. The socket between the two processes is reported as ESTABLISHED, and there is some data in the kernel socket buffer (although nowhere near the configured 16M via wmem/rmem). Both ends of the socket seem to 开发者_如何学Gobe stuck on a sendto().
Below is some investigation using netstat/lsof and strace:
HOST A (10.152.20.28)
[root@hosta ~]# lsof -n -u df01 | grep 12959 | grep 12u
q 12959 df01 12u IPv4 4398449 TCP 10.152.20.28:38521->10.152.20.29:gsigatekeeper (ESTABLISHED)
[root@hosta ~]# netstat -anp | grep 38521
tcp 268754 90712 10.152.20.28:38521 10.152.20.29:2119 ESTABLISHED 12959/q
[root@hosta ~]# strace -p 12959
Process 12959 attached - interrupt to quit
sendto(12, "sometext\0somecode\0More\0exJKsss"..., 542, 0, NULL, 0 <unfinished ...>
Process 12959 detached
[root@hosta~]#
HOST B (10.152.20.29)
[root@hostb ~]# netstat -anp | grep 38521
tcp 72858 110472 10.152.20.29:2119 10.152.20.28:38521 ESTABLISHED 25512/q
[root@hostb ~]# lsof -n -u df01 | grep 38521
q 25512 df01 14u IPv4 6456715 TCP 10.152.20.29:gsigatekeeper->10.152.20.28:38521 (ESTABLISHED)
[root@hostb ~]# strace -p 25512
Process 25512 attached - interrupt to quit
sendto(14, "\0\10\0\0\0Owner\0sym\0Type\0Ctpy\0Time\0Lo"..., 207, 0, NULL, 0 <unfinished ...>
Process 25512 detached
[root@hostb~]#
We have upgraded the NIC driver to the latest and greatest. The systems are running RHEL 5.6 x64 (2.6.18-238.el5), I have checked the eratta for RHEL 5.7 and 5.8 but I can see no mention of bugs with the bnx2 driver or the kernel.
Does anyone have any ideas of how to debug this further?
Is either side actually reading? If not, it could be that both sides' receive buffers are full, leading to not sending data (due to the receive window being filled), leading to both send buffers being filled, which will cause sendto
to block. (It's possible that this could happen despite your setting of wmem/rmem if the application is setting the SO_RCVBUF
and SO_SNDBUF
socket options.)
To debug this, I'd synchronize both machine's clocks, then run both applications under strace
with the -e trace=network
and -tt
options, so you can compare the logs and see if the application isn't reading.
You could also use a network analyzer (such as Wireshark) to determine if the TCP receive window gets stuck on 0.
If this is the case, you could probably work around this by creating a small caching proxy, which would recv/send from both sides, buffering whatever can't be sent at the time.
精彩评论