Simulating LDREX/STREX (load/store exclusive) in Cortex-M0_问答_开发者

In the Cortex-M3 instruction set, there exist a family of LDREX/STREX instructions such that if a location is read with an LDREX instruction, a following STREX instruction can write开发者_运维技巧 to that address only if the address is known to have been untouched. Typically, the effect is that the STREX will succeed if no interrupts ("exceptions" in ARM parlance) have occurred since the LDREX, but fail otherwise.

What's the most practical way to simulate such behavior in the Cortex M0? I would like to write C code for the M3 and have it portable to the M0. On the M3, one can say something like:

__inline void do_inc(unsigned int *dat)
{
  while(__strex(__ldrex(dat)+1,dat)) {}
}

to perform an atomic increment. The only ways I can think of to achieve similar functionality on the Cortex-M0 would be to either:

Have "ldrex" disable exceptions and have "strex" and "clrex" re-enable them, with the requirement that every "ldrex" must be followed soon thereafter by either a "strex" or "clrex".
Have "ldrex", "strex", and "clrex" be a very small routines in RAM, with one instruction of "ldrex" being patched to either "str r1,[r2]" or "mov r0,#1". Have the "ldrex" routine plug a "str" instruction into the "strex" routine, and have the "clrex" routine plug "mov r0,#1" there. Have all exceptions that might invalidate a "ldrex" sequence call "clrex".

Depending upon how the ldrex/strex functions are used, disabling interrupts might work reasonably, but it seems icky to change the semantics of "load-exclusive" so as to cause bad side-effects if it's abandoned. The code-patching idea seems like it would achieve the desired semantics, but it seems clunky.

(BTW, side question: I wonder why STREX on the M3 stores the success/failure indication to a register rather than simply setting a flag? Its actual operation requires four extra bits in the opcode, requires that a register be available to hold the success/failure indication, and requires that a "cmp r0,#0" be used to determine if it succeeded. Was it expected that compilers wouldn't be able to handle a STREX intrinsic sensibly if they didn't get the result in a register? Getting Carry into a register takes two short instructions.)

~~Well... you still have SWP remaining, but it's a less powerful atomic instruction.~~

Interrupt disabling is sure to work though. :-)

Edit:

No SWP on -m0, sorry supercat.

OK, seems you're only left with interrupt disabling. You can use gcc-compilable inline asm as a guide how to disable and properly restore it: http://repo.or.cz/w/cbaos.git/blob/HEAD:/arch/arm-cortex-m0/include/lock.h

The Cortex-M3 was designed to heavy low-latency and low-jitter multitasking, i.e. it's interrupt controller cooperates with the core in order to keep guarantees on number of cycles since interrupt triggering to interrupt handling. The ldrex/strex was implemented as a way to cooperate with all that (by all that I mean interrupt masking and other details such as atomic bit setting via bitband aliases), as otherwise, a single core, non-MMU, non-cache µC would have little use for it. If it didn't implement it though, a low priority task would have to hold a lock and that could generate small priority inversions, with latency and jitter which a hard real time system can't cope with, at least not within the order of magnitude allowed by the "retry" semantics that a failed ldrex/ strex has.

On a side note, and speaking strictly in terms of timings and jitter, the Cortex-M0 has a more traditional interrupt timing profile (i.e. it will not abort instructions on the core when an interrupt arrive), being subject to way more jitter and latency. On this matter (again, strictly timing), it's more comparable to older models (i.e. the arm7tdmi), which also lacks atomic load/modify/store as well as atomic increments & decrements and other low-latency cooperative instructions, requiring interrupt disable/enable more often.

I use something like this in Cortex-M3:

#define unlikely(x) __builtin_expect((long)(x),0)
    static inline int atomic_LL(volatile void *addr) {
      int dest;

  __asm__ __volatile__("ldrex %0, [%1]" : "=r" (dest) : "r" (addr));
  return dest;
}

static inline int atomic_SC(volatile void *addr, int32_t value) {
  int dest;

  __asm__ __volatile__("strex %0, %2, [%1]" :
          "=&r" (dest) : "r" (addr), "r" (value) : "memory");
  return dest;
}

/**
 * atomic Compare And Swap
 * @param addr Address
 * @param expected Expected value in *addr
 * @param store Value to be stored, if (*addr == expected).
 * @return 0  ok, 1 failure.
 */
static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret;

  do {
    if (unlikely(atomic_LL(addr) != expected))
      return 1;
  } while (unlikely((ret = atomic_SC(addr, store))));
  return ret;

}

In other words, it takes ldrex/strex into well-known Linked Load and Store Conditional, and with it it also implements the Compare and Swap semantics.

If your code does fine with only compare-and-swap, you can implement it for cortex-m0 like this:

static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret = 1;

   __interrupt_disable();
   if (*(volatile uint32_t *)addr) == expected) {
      *addr = store;
      ret = 0;
   }
   __interrupt_enable();
   return ret;
}

That's the most used pattern because some architectures originally only had it (x86 comes to mind).

Implementing an emulation of LL/SC pattern by CAS seems ugly from where I stand. Specially when the SC is more than a few instructions apart from LL, but although very common, ARM doesn't recommend it specially in the Cortex-M3 case because as any interrupts will make strex fail, if you start to taking too long between ldrex/strex your code will spend a lot of time in a loop retrying strex, which could be interpreted as abusing the pattern and defeating it's own purpose.

As for your side question, in the cortex-m3 case the strex return in a register because the semantics were already defined by higher-level architectures (strex/ldrex exists in multi-core arms that were implemented before armv7-m was defined, and after it, where the cache controllers actually check for ldrex/strex addresses, i.e. strex only fails when the cache controller can't prove the dataline the load/store touches was unmodified).

If I were to speculate, I'd say the original design have this semantic because in early days this kind of atomics were designed thinking in libraries: you'd return success/failure in functions implemented in assembler and this would need to respect the ABI and most of them (all I know off) uses a register or stack, and not the flags, to return values.

Also, compilers are better in using register coloring than to clobbering the flags in case some other instruction uses it, i.e. consider a complex operation which generates flags and in the mid of it you have a ldrex/strex sequence, and the operation that comes afterwards needs the flags: the compiler would have to move the flags to a register, requiring extra instruction(s) anyway.

You can emulated missing instruction on Cortex M0(+) cores in the HardFault handle before returning to after faulted instruction even though the official ARM v6M specification strongly recommends to treat the HardFault exception as fatal and hold or reset the chip without ever leaving the handler context.

The example code provided by m0FaultDispatch (ab)uses this capability to emulate an other missing instruction (integer division). Unless you're very careful and know all possible causes of HardFaults on your chip such an emulation could hide other valid HardFault causes letting your code continue into uncharted waters.

And no emulation can come close matching the performance expected of LDREX/STREX on ARM v7M chips.

Edit: Emulating the mutual exclusion monitor requires wrapping all other exceptions with the MPU handler (aka HardFault again), some more normal form of trampoline code, or adding explicit support to all interrupt handlers.

STREX/LDREX are for multicore processors accessing shared items in memory that is shared across the cores. ARM did an unusually bad job of documenting that, you have to read between the lines in the amba/axi and arm and trm docs to figure this out.

How it works is IF you have a core that supports STREX/LDREX and IF you have a memory controller that supports exclusive access then IF the memory controller sees the pair of exclusive operations with no other core accessing that memory in between then you return EX_OKAY rather than OKAY. The arm docs tell the chip designers if it is a uniprocessor (not implementing the multicore feature) then you dont have to support exokay just return okay, which from a software perspective breaks the LDREX/STREX pair for accesses that hit that logic (the software spins in an infinite loop as it will never return success), the L1 cache does support it though so it feels like it works.

For uniprocessor and for cases where you are not accessing memory shared across the cores use SWP.

The -m0 does not support ldrex/strex nor swp, but what are those basically getting you? They are simply getting you an access that is not affected by you doing an access. to prevent you from stomping on yourself then just disable interrupts for the duration, the way we have done atomic accesses since the dark ages. if you want protection from you and a peripheral if you have a peripheral that can interfere, well there is no way to get around that and even a swap may not have helped.

So just disable interrupts around the critical section.