Low Latency Networking Technqiues and Silver-Bullets_问答_开发者

After some basic googling of low-latency networking I've come up with the following list of things programmers and system designers should consider when embarking on low latency networking:

The design of the hardware, systems and protocols have to be considered together
Develop protocols using UDP instead of TCP and implement simple ack-nak, resend logic at the application level
Reduce the number of context switches (preferably to zero) for the process or thread that consumes and packetizes data off the wire
Use the best selector for the OS (select, kqueue, epoll etc)
Use good quality NICs and Switches with large amounts of on-board buffer (fifo)
Use multiple NICs, specifically for down-stream and up-stream data flows
Reduce the number of IRQs being generated by other devices or software (in short remove them if they are not required)
Reduce the usage of mutexes and conditions. Instead where possible use Lock-Free programming techniques. Make use of the architecture's CAS capabilities. (Lock-Free containers)
Consider single threaded over multi-threaded designs - context switches are very expensive.
Understand and properly utilize your architecture's cache system (L1/L2, RAM etc)
Prefer complete control over memory management, rather than delegating to Garbage Collectors
Use good quality cables, keep the cables as short as possible, reduce the number of twists and curls

My question: I was wondering what other thin开发者_StackOverflow中文版gs fellow SOers believe are important when embarking on low latency networking.

Feel free to critique any of the above points

Cable quality is usually kind of a red herring. I'd think more about connecting up a network analyzer to see whether you're getting enough re-transmissions to care about. If you're getting very many, try to isolate where they're happening, and replace the cable(s) that is/are causing the problem. If you're not getting errors that result in re-transmissions, then the cable has (virtually) no effect on latency.

Large buffers on NICs and (especially) switches won't, themselves, reduce latency. In fact, to truly minimize latency, you normally want to use the smallest buffers you can, not larger ones. Data sitting in a buffer instead of being processed immediately increases latency. Truthfully, it's rarely worth worrying about, but still. If you really want to minimize latency (and care a lot less about bandwidth) you'd be better off using a hub than a switch (kind of hard to find anymore, but definitely low latency as long as network congestion is low enough).

Multiple NICs can help bandwidth a lot, but their effect on latency is generally pretty minimal.

Edit: My primary advice, however, would be to get a sense of scale. Reducing a network cable by a foot saves you about a nanosecond -- on the same general order as speeding up packet processing by a couple of assembly language instructions.

Bottom line: Like any other optimization, to get very far you need to measure where you're getting latency before you can do much to reduce it. In most cases, reducing wire lengths (to use one example) won't make enough difference to notice, simply because it's fast to start with. If something starts out taking 10 microseconds, nothing you can do is going to speed it up any more than 10 microseconds, so unless you have things so fast that 10 us is a significant percentage of your time, it's not worth attacking.

Others:

1: use userland networking stacks

2: service interrupts on the same socket as the handing code (shared cache)

3: prefer fixed length protocols, even if they are a little larger in bytes (quicker parsing)

4: ignore the network byte order convention and just use native ordering

5: never allocate in routines and object pool (esp. garbage collected languages)

6: try to prevent byte copying as much as possible (hard in TCP send)

7: use cut-through switching mode

8: hack networking stack to remove TCP slow start

9: advertise a huge TCP window (but don't use it) so the other side can have a lot of inflight packets at a time

10: turn off NIC coalescing, especially for send (packetize in the app stack if you need to)

11: prefer copper over optic

I can keep going, but that should get people thinking

One I don't agree with:

1: network cables are rarely an issue except when gone bad (there is an exception to this in terms of cable type)

This may be a bit obvious, but it's a technique that I'm happy with and it works with both UDP and TCP, so I'll write about it:

1) Never queue up significant amounts of outgoing data: specifically, try to avoid marshalling your in-memory data structures into serialized-byte-buffers until the last possible moment. Instead, when your sending socket select()s as ready-for-write, flatten the current state of the relevant/dirty data structures at that time, and send() them out immediately. That way data will never "build up" on the sending side. (also, be sure to set the SO_SNDBUF of your socket to as small as you can get away with, to minimize data queueing inside the kernel)

2) You can do something similar on the receiving side, assuming your data is keyed in some way: instead of doing a (read data message, process data message, repeat) loop, you can read all available data messages and just place them into a keyed data structure (e.g. a hash table) until the socket has no more data available to read, and then (and only then) iterate over the data structure and process the data. The advantage of this is that if your receiving client has to do any non-trivial processing on the received data, then obsolete incoming messages will be automatically/implicitly dropped (as their replacement overwrites them in the keyed data structure) and so incoming packets won't back up in the kernel's incoming message queue. (You could just let the kernel's queue fill up and drop packets, of course, but then your program ends up reading the 'old' packets and dropping the 'newer' ones, which isn't usually what you want). As a further optimization, you could have the I/O thread hand the keyed data structure over to a separate processing thread, so that the I/O won't get held off by the processing.