On the Design of zero-copy Memory allocators used in high volume fast-path code_问答_开发者

Disclaimer: please do not bother reading this long post unless you are an embedded programmer, or a kernel developer, or a high performance system software programmer. It is about memory allocators and might not be of any interest to you.

I am building a library for high-volume high-performance same machine IPC, using OS specific pipes (named pipes on windows and UNIX domain sockets on UNIX). The library receives messages, and inject them into a processing engine, and this engine could be in the same process in C++ or something like Erlang or Lua, or perhaps moved off to a central machine.

The messages at the moment come from within a web-server and is part of the resolution pipeline of a HTTP request, and should scale with the web-server -- i.e., I cannot write an inefficient system because a web-servers worker threads depend on the speed with which they can offload these messages. If a web-server is capable of handling 10k requests of static pages, it should not be massively hampered by this IPC mechanism.

The data comes in as typed messages, and I believe they will have their own size statistics which could be leveraged. Doug lea's allocator creates linked lists of buckets by sizes if I remember correctly, perhaps this is the way I should go - it is the standard of a general allocator.

I want zero-copy semantics so till the message is done processing it should not by copied around. On a side note: this perhaps takes Googles protocol buffers out of the picture (can't be sure at the moment) as the wire format is different from the in-memory object format. I will investigate further -- my current implementation uses my own IPC format; perhaps someone has some opinions of how protocol buffers does things, does it use zero-copy internally ?

I want to use this allocator in all areas of the application. My study of network stack software is urging me towards a processor-page-aligned-pool based allocator; my application certainly has the same access semantics. I would copy all messages into a page and only recycle the pages once all element references -- that are allocated to the pool -- are inactive.

To slightly formalize and help you understand what I am thinking about consider the following data structure:

typedef struct _page
{
   int active_elements;            // page is returned to empty pool at zero
   int free_bytes;                 // used to sort the pages
   void * next_free_contigious;    // pointer to where next allocation may occur
   _page * next_page;              // linked list next page pointer
} page;

struct pageAllocator
{
   page * ll_empty;       // completely empty pages
   page * ll_active;      // sorted by highest_size_available, could be binary tree 
                          // continue reading for rationale.
};

Some points are :

The messages won't linger arround. So pages will become free fairly soon.
I need to write bullet-proof code to free the pages.
I do not know if page->next_free_contigious needs to be word alligned.
I assume that under pressure the amount of pages allocated will be stable.
Should I be returning pages to the OS or should I just keep them ? Maximum pressure would only al开发者_开发问答locate as much as it needs once maximum pressure has been established.
Should I allocate messages to the first element in the size sorted list (first being the highest), or should I find the best matching page. In this case I would have to use a binary tree of some sort. Am I re-implementing doug's algorithm here ?

I want suggestions from people who have personally dealt with such things, alternate allocator suggestions would be welcome. Off the shelf libraries would be great (they have to be ANSI-C or a c-pre-processor meta-code implementation). The APR is a popular library with a hierarchical pool based strategy but I am not dealing with sessions at this low-level so I don't need a hierarchy.

I don't think I am pre-optimizing because I am convinced I need it. My current implementation of this system jumps to 30% CPU during web-server CPU saturation and I know this is because of my naive way of programming the application using new and a lot of copy constructing in C++. I will of-course benchmark my application before I proceed (I am too busy atm to re-build a test environment), but any discussion here will be useful regardless of the outcome.

--- edit 1 ---

This implementation was inspired from some comments from Markr, check answer below for some discussion. A bit more thinking lead to the following.

Ok I am thinking about a circular buffer now (say 1mb^x). Composed of equal sized elements (perhaps some multiple of 128-bytes). The allocations happen incrementally till the end (with a allocatex-head(A) marking where allocations may occur from, and a free-marker(F) marking contigious free regions(0) being returned to the pool. Consider the following ASCII illustration:

|000000F101110001111A00000000|

0=free, 1= occupied.

Some specific are:

The free-marker delimits free contiguous regions. In the illustration F will only move two forward when the immediately following chunk is returned.
Once the allocation reaches the end, or there isn't enough space at the end to perform a contiguous allocation, the Allocate-head moves to the beginning; most of the buffer should be free by this point.
once a message is moved to its destination the chunks are immediately returned to the buffer.
On same machine delivery the messages will be delivered in short order (some blocking), or immediately if it just requires a thread-free function call.

Comments would be nice, and I would still prefer some self-contained portable library if anyone knows of any that does something similar. The GLIBC slice allocator looked nice, but I believe the above implementation is simpler and probably faster, and active regions on the buffer will always have good locality.

After reading others' comments, it sounds like using a fixed pool of objects and allocating ownership of them may be a better idea than trying to mess with allocators.

Presumably you have a large number of threads, so any allocator would need to be doing quite a bit of locking; a large number of small allocs/ frees sounds undesirable.

If you think you've got enough memory to keep reasonable sized pools per thread, you can just allocate a fixed pool per thread which contains enough IPC objects to handle all the requests that thread will have outstanding at one time, and allocate them per-thread from this pool with no locking whatsoever, marking them for reuse once those requests are complete.

Once you move the other processes off-box, the problem becomes completely different as zero-copy is no longer an option (the data clearly need to be copied at least once to reach another machine) so I think you'll have other issues.

Perhaps you are looking for something like the GLib Slice Allocator?

It is designed for high-speed allocation and deallocation of similarly-sized objects. It also claims to have excellent performance in a multi-threaded environment.

The one aspect of the API that might cause trouble for you is that you must give the size of your object during deallocation as well as allocation.

Answer: Think about WHAT you want to reach.

Do you want zero-copy semantics when processing the messages?
Or do you want a fancy new uberfast memory allocator?

A "zero copy memory allocator" is non-sense!

Writing a general purpose memory allocator is difficult (most probably the default allocator is perfect for your purposes). If you need an idea for a good allocator for your purposes, then please provide some more details.

The zero-copy thing: it depends a lot of other things in your application, but for example I assume a piece of code that reads a stream from a named pipe, and another piece of code that parses this stream, and interprets this as messages.

First, reading from IPC pipe: have a list of buffers: read() from the FIFO into buffers[current_write_buffer_idx].data+write_pos a maximum of (buffer_size-write_pos) bytes. Add number of read bytes to write_pos, and so on. If a buffer is full, then check your list of buffers whether you find an unused one (keep a usage flag per buffer). If no unused one is found, then allocate a new buffer. Set next_buffer_index on the old buffer (used for the "parsing" later..).

You end up with:

size_t write_pos, read_pos;
struct Buffer {
    void* data;
    bool is_used;
    size_t number_bytes_written;
    size_t next_buffer_index;
};
Buffer *buffers;
size_t number_buffers;
size_t current_write_buffer_idx;
size_t current_read_buffer_idx;

Second, parsing the stream: find the next byte at buffers[current_read_buffer_index].data+read_pos. You can read bytes from that buffer up to buffer.number_bytes_written bytes, if number of written bytes equals buffer_size (fixed), then the buffer is full, and you have to continue reading from the buffer at next_buffer_index...

And so on. Set is_used true whenever you write to a new buffer, set it to false whenever you have done parsing the buffer.

What "reading" exactly means, depends a lot of how your message are coded!

So long... to make it (almost) zero-copy, you have to provide more details. When messages are fixed size its too easy.. But assuming your messages are prefixed with type and/or size information and the data of the message is some kind of struct, then you could do the following:

read type and/or size information from stream and..
if the message is not yet completely received (read_pos, number_bytes_written, ...) then skip ...
if the message (the size is known now) fits into a single buffer, just use the pointer into this buffer (struct MsgXYZ*)( buffers[current_read_buffer_idx].data + start_of_message_index ) and pass it to whatever code does the job
if the message does spawn the buffers, create a temporary buffer and copy the parts of the message together, pass pointer to the message to code.. after that free the temporary buffer (its a rare case anyway)

Well, I omitted a lot of details here ;) Think twice if zero copy really makes sense to your app.

Zero copy when reading from a socket/pipe/... is IMHO not possible, maybe through shared memory of some kind of mmap, but not with pipes.