Flush kernel's TCP buffer for `MSG_MORE`-flagged packets_问答_开发者

send()'s man page reveals the MSG_MORE flag which is asserted to act like TCP_CORK. I have a wrapper function around send():

int SocketConnection_Write(SocketConnection *this, void *buf, int len) {
    errno = 0;

    int sent = send(this->fd, buf, len, MSG_NOSIGNAL);

    if (errno == EPIPE || errno == ENOTCONN) {
        throw(exc, &SocketConnection_NotConnectedException);
    } else if (errno == ECONNRESET) {
        throw(exc, &SocketConnection_ConnectionResetException);
    } else if (sent != len) {
        throw(exc, &SocketConnection_LengthMismatchException);
    }

    return sent;
}

Assuming I want to use the kernel buffer, I could go with TCP_CORK, enable whenever it is necessary and then disable it to flush the buffer. But on the other hand, thereby the need for an additional system call arises. Thus, the usage of MSG_MORE seems more appropriate to me. I'd simply change the above send() line to:

int sent = send(this->fd, buf, len, MSG_NOSIGNAL | MSG_MORE);

According to lwm.net, packets will be flushed automatically if they are large enough:

If an application sets that option on a socket, the kernel will not send out short packets. Instead, it will wait until enough data has shown up to fill a maximum-size packet, then send it. When TCP_CORK is turned off, any remaining data will go out on the wire.

But this section only refers to TCP_CORK. Now, what is the proper way to flush MSG_MORE packets?

I can only think of two possibilities:

Call send() with an empty buffer and without MSG_MORE being set
Re-apply the TCP_CORK option as described on this page

Unfortunately 开发者_如何学Pythonthe whole topic is very poorly documented and I couldn't find much on the Internet.

I am also wondering how to check that everything works as expected? Obviously running the server through strace is not an option. So the simplest way would be to use netcat and then look at its strace output? Or will the kernel handle traffic transmitted over a loopback interface differently?

I have taken a look at the kernel source and both assumptions seem to be true. The following code are extracts from net/ipv4/tcp.c (2.6.33.1).

static inline void tcp_push(struct sock *sk, int flags, int mss_now,
                int nonagle)
{
    struct tcp_sock *tp = tcp_sk(sk);

    if (tcp_send_head(sk)) {
        struct sk_buff *skb = tcp_write_queue_tail(sk);
        if (!(flags & MSG_MORE) || forced_push(tp))
            tcp_mark_push(tp, skb);
        tcp_mark_urg(tp, flags, skb);
        __tcp_push_pending_frames(sk, mss_now,
                      (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
    }
}

Hence, if the flag is not set, the pending frames will definitely be flushed. But this is be only the case when the buffer is not empty:

static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
             size_t psize, int flags)
{
(...)
    ssize_t copied;
(...)
    copied = 0;

    while (psize > 0) {
(...)
        if (forced_push(tp)) {
            tcp_mark_push(tp, skb);
            __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
        } else if (skb == tcp_send_head(sk))
            tcp_push_one(sk, mss_now);
        continue;

wait_for_sndbuf:
        set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
        if (copied)
            tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);

        if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
            goto do_error;

        mss_now = tcp_send_mss(sk, &size_goal, flags);
    }

out:
    if (copied)
        tcp_push(sk, flags, mss_now, tp->nonagle);
    return copied;

do_error:
    if (copied)
        goto out;
out_err:
    return sk_stream_error(sk, flags, err);
}

The while loop's body will never be executed because psize is not greater 0. Then, in the out section, there is another chance, tcp_push() gets called but because copied still has its default value, it will fail as well.

So sending a packet with the length 0 will never result in a flush.

The next theory was to re-apply TCP_CORK. Let's take a look at the code first:

static int do_tcp_setsockopt(struct sock *sk, int level,
        int optname, char __user *optval, unsigned int optlen)
{

(...)

    switch (optname) {
(...)

    case TCP_NODELAY:
        if (val) {
            /* TCP_NODELAY is weaker than TCP_CORK, so that
             * this option on corked socket is remembered, but
             * it is not activated until cork is cleared.
             *
             * However, when TCP_NODELAY is set we make
             * an explicit push, which overrides even TCP_CORK
             * for currently queued segments.
             */
            tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
            tcp_push_pending_frames(sk);
        } else {
            tp->nonagle &= ~TCP_NAGLE_OFF;
        }
        break;

    case TCP_CORK:
        /* When set indicates to always queue non-full frames.
         * Later the user clears this option and we transmit
         * any pending partial frames in the queue.  This is
         * meant to be used alongside sendfile() to get properly
         * filled frames when the user (for example) must write
         * out headers with a write() call first and then use
         * sendfile to send out the data parts.
         *
         * TCP_CORK can be set together with TCP_NODELAY and it is
         * stronger than TCP_NODELAY.
         */
        if (val) {
            tp->nonagle |= TCP_NAGLE_CORK;
        } else {
            tp->nonagle &= ~TCP_NAGLE_CORK;
            if (tp->nonagle&TCP_NAGLE_OFF)
                tp->nonagle |= TCP_NAGLE_PUSH;
            tcp_push_pending_frames(sk);
        }
        break;
(...)

As you can see, there are two ways to flush. You can either set TCP_NODELAY to 1 or TCP_CORK to 0. Luckily, both won't check whether the flag is already set. Thus, my initial plan to re-apply the TCP_CORK flag can be optimized to just disable it, even if it's currently not set.

I hope this helps someone with similar issues.

That's a lot of research... all I can offer is this empirical post note:

Sending a bunch of packet with MSG_MORE set, followed by a packet without MSG_MORE, the whole lot goes out. It works a treat for something like this:

  for (i=0; i<mg_live.length; i++) {
        // [...]
        if ((n = pth_send(sock, query, len, MSG_MORE | MSG_NOSIGNAL)) < len) {
           printf("error writing to socket (sent %i bytes of %i)\n", n, len);
           exit(1);
        }
     }
  }

  pth_send(sock, "END\n", 4, MSG_NOSIGNAL);

That is, when you're sending out all the packets at once, and have a clearly defined end... AND you are only using one socket.

If you tried writing to another socket in the middle of the above loop, you may find that Linux releases the previously held packets. At least that appears to be the trouble I'm having right now. But it might be an easy solution for you.