Can i allocate memory faster by using multiple threads?_问答_开发者

If i make a loop that reserves 1kb integer arrays, int[1024], and i want it to allocate 10000 arrays, can i make it faster by running the memory allocations from multiple threads?

I want them to be in the heap.

Let's assume that i have a multi-core processor for the job.

I already did try this, but it decreased the performance. I'm just wondering, did I just make bad code or is there something that i didn't know about memory allocation?

Does the answer depend on the OS? please tell me how it works on different platforms if so.

Edit:

The integer array allocation loo开发者_开发问答p was just a simplified example. Don't bother telling me how I can improve that.

It depends on many things, but primarily:

the OS
the implementation of malloc you are using

The OS is responsible for allocating the "virtual memory" that your process has access to and builds a translation table that maps the virtual memory back to actual memory addresses.

Now, the default implementation of malloc is generally conservative, and will simply have a giant lock around all this. This means that requests are processed serially, and the only thing that allocating from multiple threads instead of one does is slowing down the whole thing.

There are more clever allocation schemes, generally based upon pools, and they can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) are two such implementations designed for high-performance in multi-threaded applications.

There is no silver bullet though, and at one point the OS must perform the virtual <=> real translation which requires some form of locking.

Your best bet is to allocate by arenas:

Allocate big chunks (arenas) at once
Split them up in arrays of the appropriate size

There is no need to parallelize the arena allocation, and you'll be better off asking for the biggest arenas you can (do bear in mind that allocation requests for a too large amount may fail), then you can parallelize the split.

tcmalloc and jemalloc may help a bit, however they are not designed for big allocations (which is unusual) and I do not know if it is possible to configure the size of the arenas they request.

The answer depends on the memory allocations routine, which are a combination of a C++ library layer operator new, probably wrapped around libC malloc(), which in turn occasionally calls an OS function such as sbreak(). The implementation and performance characteristics of all of these is unspecified, and may vary from compiler version to version, with compiler flags, different OS versions, different OSes etc.. If profiling shows it's slower, then that's the bottom line. You can try varying the number of threads, but what's probably happening is that the threads are all trying to obtain the same lock in order to modify the heap... the overheads involved with saying "ok, thread X gets the go ahead next" and "thread X here, I'm done" are simply wasting time. Another C++ environment might end up using atomic operations to avoid locking, which might or might not prove faster... no general rule.

If you want to complete faster, consider allocating one array of 10000*1024 ints, then using different parts of it (e.g. [0]..[1023], [1024]..[2047]...).

I think that perhaps you need to adjust your expectation from multi-threading.

The main advantage of multi-threading is that you can do tasks asynchronously, i.e. in parallel. In your case, when your main thread needs more memory it does not matter whether it is allocated by another thread - you still need to stop and wait for allocation to be accomplished, so there is no parallelism here. In addition, there is an overhead of a thread signaling when it is done and the other waiting for completion, which just can degrade the performance. Also, if you start a thread each time you need allocation this is a huge overhead. If not, you need some mechanism to pass the allocation request and response between threads, a kind of task queue which again is an overhead without gain.

Another approach could be that the allocating thread runs ahead and pre-allocates the memory that you will need. This can give you a real gain, but if you are doing pre-allocation, you might as well do it in the main thread which will be simpler. E.g. allocate 10M in one shot (or 10 times 1M, or as much contiguous memory as you can have) and have an array of 10,000 pointers pointing to it at 1024 offsets, representing your arrays. If you don't need to deallocate them independently of one another this seems to be much simpler and could be even more efficient than using multi-threading.

As for glibc it has arena's (see here), which has lock per arena.

You may also consider tcmalloc by google (stands for Thread-Caching malloc), which shows 30% boost performance for threaded application. We use it in our project. In debug mode it even can discover some incorrect usage of memory (e.g. new/free mismatch)

As far as I know all os have implicit mutex lock inside the dynamic allocating system call (malloc...). If you think a moment about that, if you do not lock this action you could run into terrible problems.

You could use the multithreading api threading building blocks http://threadingbuildingblocks.org/ which has a multithreading friendly scalable allocator.

But I think a better idea should be to allocate the whole memory once(should work quite fast) and split it up on your own. I think the tbb allocator does something similar.

Do something like

new int[1024*10000] and than assign the parts of 1024ints to your pointer array or what ever you use.

Do you understand?

Because the heap is shared per-process the heap will be locked for each allocation, so it can only be accessed serially by each thread. This could explain the decrease of performance when you do alloc from multiple threads like you are doing.

If the arrays belong together and will only be freed as a whole, you can just allocate an array of 10000*1024 ints, and then make your individual arrays point into it. Just remember that you cannot delete the small arrays, only the whole.

int *all_arrays = new int[1024 * 10000];
int *small_array123 = all_arrays + 1024 * 123;

Like this, you have small arrays when you replace the 123 with a number between 0 and 9999.

The answer depends on the operating system and runtime used, but in most cases, you cannot.

Generally, you will have two versions of the runtime: a multi-threaded version and a single-threaded version.

The single-threaded version is not thread-safe. Allocations made by two threads at the same time can blow your application up.

The multi-threaded version is thread-safe. However, as far as allocations go on most common implementations, this just means that calls to malloc are wrapped in a mutex. Only one thread can ever be in the malloc function at any given time, so attempting to speed up allocations with multiple threads will just result in a lock convoy.

It may be possible that there are operating systems that can safely handle parallel allocations within the same process, using minimal locking, which would allow you to decrease time spent allocating. Unfortunately, I don't know of any.