Memory Bandwidth Performance for Modern Machines_问答_开发者

I'm designing a real-time system that occasionally has to duplicate a large amount of memory. The memory consists of non-tiny regions, so I expect the copying performance will be fairly close to the maximum bandwidth the relevant components (CPU, RAM, MB) can do. This led me to wonder what kind of raw memory bandwidth modern commodity machine can muster?

My aging Core2Duo gives me 1.5 GB/s if I use 1 thread to memcpy() (and understandably less if I memcpy() with both cores simultaneously.) While 1.5 GB is a fair amount of data, the real-time application I'm working on will have have something like 1/50th of a second, which means 30 MB. Basically, almost nothing. And perhaps worst of all, as I add multiple cores, I can process a lot more data without any increased performance for the needed duplication step.

But a low-end Core2Due isn't exactly hot stuff these days. Are there any sites with informat开发者_高级运维ion, such as actual benchmarks, on raw memory bandwidth on current and near-future hardware?

Furthermore, for duplicating large amounts of data in memory, are there any shortcuts, or is memcpy() as good as it will get?

Given a bunch of cores with nothing to do but duplicate as much memory as possible in a short amount of time, what's the best I can do?

EDIT: I'm still looking for good information on raw Memory Copy performance. I just ran my old memcpy() benchmark. Same machine and settings, now gives 2.5 GB/s...

On newer CPU's such as the Nehalem, and on AMD's since the Opteron, the memory is "local" to one CPU, where a single CPU may have multiple cores. That is, it takes a certain amount of time for a core to access the local memory attached to it's CPU, and more time for the core to access remote memory, where remote memory is memory that is local to other CPUs. This is called non-uniform memory access, or NUMA. For the best memcpy performance, you want to set your BIOS to NUMA mode, pin your threads to cores, and always access local memory. Find out more about NUMA on wikipedia.

Unfortunately I do not know of a site or recent papers on memcpy performance on recent CPUs and chipsets. You best bet is probably to test it yourself.

As for memcpy() performance, there are wide variations, depending on the implementation. The Intel C library (or possibly the compiler itself) has a memcpy() that is much faster than the one provided with Visual Studio 2005, for instance. At least on Intel machines.

The best memory copy you will be able to do will depend on the alignment of your data, wether you are able to use vector instructions, and page size, etc. Implementing a good memcpy() is surprisingly involved, so I recommend finding and testing as many implementations as possible before writing your own. If you know more specifics about your copy, such as alignment and size, you might be able to implement something faster than Intel's memcpy(). If you want to get into the details, you might start with the Intel and AMD optimization guides, or Agner Fog's software optimization pages.

I think you're approaching the problem the wrong way. The goal, I assume, is to export a consistent snapshot of your data without destroying your real-time performance. Don't use hardware, use an algorithm.

What you want to do is define a journaling system on top of your data. When you start your in-memory transfer, you have two threads: the original that does work and thinks it is modifying the data (but is actually only writing to the journal), and a new thread that copies the old (unjournaled) data to a separate spot so it can slowly write it out.

When the new thread is done, you put it to work merging the data set with the journal until the journal is empty. When it's complete, the old thread can go back to interacting directly with the data instead of reading/writing through the journal-modified version.

Finally, the new thread can go over to the copied data and start slowly passing it away to a remote source.

If you set up a system like this, you can get essentially instant snapshotting of arbitrarily large amounts of data in a running system, as long as you can finish the in-memory copy before the journal gets so full that the real-time system can't keep up with its processing demands.