order and barrier:what is the equivalent instruction on x86 for 'lwsync' on PowerPC?_问答_开发者

order and barrier:what is the equivalent instruction on x86 for 'lwsync' on PowerPC?

开发者 https://www.devze.com 2023-01-11 08:04 出处：网络

My code is simple as below.I f开发者_StackOverflowound rmb and wmb for read and write,but found no general one.lwsync is available on PowerPC,but what is the replacement for x86?Thanks in advance.

My code is simple as below.I f开发者_StackOverflowound rmb and wmb for read and write,but found no general one.lwsync is available on PowerPC,but what is the replacement for x86?Thanks in advance.

#define barrier() __asm__ volatile ("lwsync")
...
    lock()
    if(!pInst);
    {
        T* temp=new T;
        barrier();
        pInst=temp;
    }
    unlock();

rmb() and wmb() are the Linux kernel functions. There is also mb().

The x86 instructions are lfence, sfence, and mfence, IIRC.

There's a particular file in the Cilk runtime you might find interesting i.e. cilk-sysdep.h where it contains system specific mappings w.r.t memory barriers. I extract a small section w.r.t ur question on x86 i.e. i386

    file:-- cilk-sysdep.h (the numbers on the LHS are actually line numbers)

    252      * We use an xchg instruction to serialize memory accesses, as can
    253      * be done according to the Intel Architecture Software Developer's
    254      * Manual, Volume 3: System Programming Guide
    255      * (http://www.intel.com/design/pro/manuals/243192.htm), page 7-6,
    256      * "For the P6 family processors, locked operations serialize all
    257      * outstanding load and store operations (that is, wait for them to
    258      * complete)."  The xchg instruction is a locked operation by
    259      * default.  Note that the recommended memory barrier is the cpuid
    260      * instruction, which is really slow (~70 cycles).  In contrast,
    261      * xchg is only about 23 cycles (plus a few per write buffer
    262      * entry?).  Still slow, but the best I can find.  -KHR 
    263      *
    264      * Bradley also timed "mfence", and on a Pentium IV xchgl is still quite a bit faster
    265      *   mfence appears to take about 125 ns on a 2.5GHZ P4
    266      *   xchgl  apears  to take about  90 ns on a 2.5GHZ P4
    267      * However on an opteron, the performance of mfence and xchgl are both *MUCH MUCH   BETTER*.
    268      *   mfence takes 8ns on a 1.5GHZ AMD64 (maybe this is an 801)
    269      *   sfence takes 5ns
    270      *   lfence takes 3ns
    271      *   xchgl  takes 14ns
    272      * see mfence-benchmark.c
    273      */
    274     int x=0, y;
    275     __asm__ volatile ("xchgl %0,%1" :"=r" (x) :"m" (y), "0" (x) :"memory");
    276    }

What i liked about this is the fact that xchgl appears to be faster :) though you should really implement them and check it out.

You don't say exactly what lock and unlock are in this code. I'm presuming they are mutex operations. On powerpc a mutex acquire function will use an isync (without which the hardware may evaluate your if (!pInst) before the lock()), and will have an lwsync (or sync if your mutex implementation is ancient) in the unlock().

So, presuming all your accesses (both read and write) to pInst are guarded by your lock and unlock methods your barrier use is redundant. The unlock will have a sufficient barrier to ensure that the pInst store is visible before the unlock operation completes (so that it will be visible after any subsequent lock acquire, presuming the same lock is used).

On x86 and x64 your lock() will use some form of LOCK prefixed instruction, which automatically has bidirectional fencing behaviour.

Your unlock on x86 and x64 only has to be a store instruction (unless you use some of the special string instructions within your CS, in which case you'll need an SFENCE).

The manual:

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

has good information on all the fences as well as the effects of the LOCK prefix (and when that is implied).

ps. In your unlock code you'll also have to have something that enforces compiler ordering (so if it is just a store zero, you'll also need something like the GCC style asm _volatile_ ( "" ::: "memory" ) ).