Parallel programming under x86 can be hard job especially under multi-core CPU. Let say that we have multi-core x86 CPU and more different multithread communication combinations.
- Single writer and single reader
- Single reader multiple writers
- Multiple readers and single writer
- Multiple readers and multiple writers
So which one model is better (more efficient) for locking shared memory region: Test&Set or Test&Test&Set and when to use it!
Here I have two simple (no time limited) test procedures written in under Delphi IDE in x86 assembler:
procedure TestAndSet(const oldValue, newValue: cardinal; var destination);
asm
//eax = oldValue
//edx = NewLockValue
//ecx = destination = 32 bit pointer on lock variable 4 byte aligned
@RepeatSpinLoop:
push eax //Save lock oldValue (compared)
pause //CPU spin-loop hint
lock cmpxchg dword ptr [ecx], edx
pop eax //Restore eax as oldValue
jnz @RepeatSpinLoop //Repeat if cmpxchg wasn't successful
end;
procedure TestAndTestAndSet(const oldValue, newValue: cardinal; var desti开发者_如何学运维nation);
asm
//eax = oldValue
//edx = NewLockValue
//ecx = destination = 32 bit pointer on lock variable 4 byte aligned
@RepeatSpinLoop:
push eax //Save lock oldValue (compared)
@SpinLoop:
pause //CPU spin-loop hint
cmp dword ptr [ecx], eax //Test betfore test&set
jnz @SpinLoop
lock cmpxchg dword ptr [ecx], edx
pop eax //Restore eax as oldValue
jnz @RepeatSpinLoop //Repeat if cmpxchg wasn't successful
end;
EDIT:
Intel in documentation mention two approach Test&Set or Test&Test&Set. I' wont to establish in which case is someone approach better, so when to use it. Check: Intel
Surely the first (testAndSet) is better because the 2nd does not achieve much with repeating the test using cmp & jnz - in between. While you are doing this the destination value may change anyway as it is not locked.
TTAS (#2) is good practice. "Lurking" and waiting for the "opportunity" before doing CAS
is common practice in both Java and .NET concurrent classes. With that said, cmpxchg
received quite a lot of optimizations in the last few years, so it might be possible that you'd get nearly identical results on the latest crop of processors.
What you should try in both cases, however is to employ some exponential backoff when you spin.
Update
@GJ: You should find some more up-to-date documentation on Intel's site. Note the paragraph about not locking the bus since 486 and the comparison chart of xchg
and cmpxchg
that shows that they are practically identical.
Spinning on a read vs on a locked
instruction will still be a good idea to avoid some contention on getting the cache line in exclusive mode. (So TTAS.)
However this will provide a useful gain only if you implement e.g. an exponential back-off, even yielding the CPU after a while.
The differences between TTAS and TAS, or w/o backoff would be smaller if you are using a single, modern multi-core CPU with a shared L3 cache between the cores and would become more visible if you are using a multi-socket - e.g. server - machine or a multi-core CPU that has no shared cache between the cores. They would also be different based on the amount of contention. (I.e. light load would see smaller difference between TTAS/TAS.)
I'd use the 2nd approch, a test with not lock, then a lock if the test sucessed, with some proposals:
- use call SwitchToThread instead of pause
- put a call SwitchToThread in the not-locked repeat cmp loop
- put the call SwitchToThread only in case of the cmp/lock failure
In all cases, I think you'd better:
- use Windows API for your synchronization, if you really want to handle low-level synchronization in your project, see Synchronization Functions on MSDN - Microsoft made the low-level and optimization work for you. Most of these calls are optimized asm code, running in user mode, so are very fast.
- use a high-level multi-thread framework, which in practice will handle all these problems for you, and will definitively scale well - see the Delphi OmniThreadLibrary
- use a dedicated memory manager, like NexusMM, TBBMM, or ScaleMM/SynScaleMM
精彩评论