My code is simple as below.I f开发者_StackOverflowound rmb and wmb for read and write,but found no general one.lwsync is available on PowerPC,but what is the replacement for x86?Thanks in advance.
#define barrier() __asm__ volatile ("lwsync")
...
lock()
if(!pInst);
{
T* temp=new T;
barrier();
pInst=temp;
}
unlock();
rmb()
and wmb() are the Linux kernel functions. There is also mb()
.
The x86 instructions are lfence
, sfence
, and mfence
, IIRC.
There's a particular file in the Cilk runtime you might find interesting i.e. cilk-sysdep.h where it contains system specific mappings w.r.t memory barriers. I extract a small section w.r.t ur question on x86 i.e. i386
file:-- cilk-sysdep.h (the numbers on the LHS are actually line numbers) 252 * We use an xchg instruction to serialize memory accesses, as can 253 * be done according to the Intel Architecture Software Developer's 254 * Manual, Volume 3: System Programming Guide 255 * (http://www.intel.com/design/pro/manuals/243192.htm), page 7-6, 256 * "For the P6 family processors, locked operations serialize all 257 * outstanding load and store operations (that is, wait for them to 258 * complete)." The xchg instruction is a locked operation by 259 * default. Note that the recommended memory barrier is the cpuid 260 * instruction, which is really slow (~70 cycles). In contrast, 261 * xchg is only about 23 cycles (plus a few per write buffer 262 * entry?). Still slow, but the best I can find. -KHR 263 * 264 * Bradley also timed "mfence", and on a Pentium IV xchgl is still quite a bit faster 265 * mfence appears to take about 125 ns on a 2.5GHZ P4 266 * xchgl apears to take about 90 ns on a 2.5GHZ P4 267 * However on an opteron, the performance of mfence and xchgl are both *MUCH MUCH BETTER*. 268 * mfence takes 8ns on a 1.5GHZ AMD64 (maybe this is an 801) 269 * sfence takes 5ns 270 * lfence takes 3ns 271 * xchgl takes 14ns 272 * see mfence-benchmark.c 273 */ 274 int x=0, y; 275 __asm__ volatile ("xchgl %0,%1" :"=r" (x) :"m" (y), "0" (x) :"memory"); 276 }
What i liked about this is the fact that xchgl appears to be faster :) though you should really implement them and check it out.
You don't say exactly what lock and unlock are in this code. I'm presuming they are mutex operations. On powerpc a mutex acquire function will use an isync (without which the hardware may evaluate your if (!pInst) before the lock()), and will have an lwsync (or sync if your mutex implementation is ancient) in the unlock().
So, presuming all your accesses (both read and write) to pInst are guarded by your lock and unlock methods your barrier use is redundant. The unlock will have a sufficient barrier to ensure that the pInst store is visible before the unlock operation completes (so that it will be visible after any subsequent lock acquire, presuming the same lock is used).
On x86 and x64 your lock() will use some form of LOCK prefixed instruction, which automatically has bidirectional fencing behaviour.
Your unlock on x86 and x64 only has to be a store instruction (unless you use some of the special string instructions within your CS, in which case you'll need an SFENCE).
The manual:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
has good information on all the fences as well as the effects of the LOCK prefix (and when that is implied).
ps. In your unlock code you'll also have to have something that enforces compiler ordering (so if it is just a store zero, you'll also need something like the GCC style asm _volatile_ ( "" ::: "memory" ) ).
精彩评论