Compiler optimization breaks multi-threaded code_问答_开发者

After having learnt the hard way that shared variables are currently not guarded by memory barriers, I have now encountered another issue. Either I am doing something wrong, or the existing compiler optimization in dmd can break multi-threaded code by re-ordering reads of shared variables.

As an example, when I compile an executable with dmd -O (full optimization), the compiler happily optimizes away the local variable o in this code (where cas is the compare-and-swap function from core.atomic)

shared uint cnt;
void atomicInc  ( ) { uint o; do { o = cnt; } while ( !cas( &cnt, o, o + 1 ) );}

to something like this (see dis-assembly below):

shared uint cnt;
void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }

In the "optimized" code cnt is read twice from memory, thereby running the risk that another thread has modified cnt in between. The optimization basically destroys the compare-and-swap algorithm.

Is this a bug, or is there a correct way to achieve the desired result? The only work-around I have found so far is to implement the code using assembler.

Full test code and additional details

For completeness, here is a full test code that shows both issues (no memory-barriers, and the optimization problem). It produces the following output on three different Windows machines for both dmd 2.049 and dmd 2.050 (assuming that Dekker's algorithm doesn't deadlock, which might happen):

dmd -O -run optbug.d
CAS   : failed
Dekker: failed

And the loop inside atomicInc gets compiled to this with full optimization:

; cnt is stored at 447C10h
; while ( !cas( &cnt, o, o + 1 ) ) o = cnt;
; 1) prepare call cas( &cnt, o, o + 1 ): &cnt and o go to stack, o+1 to eax
402027: mov    ecx,447C10h         ; ecx = &cnt
40202C: mov    eax,[447C10h]     ; eax = o1 = cnt
402031: inc    eax                 ; eax = o1 + 1 (third parameter)
402032: push   ecx                 ; push &cnt (first parameter)
    ; next instruction pushes current value of cnt onto stack
    ; as second parameter o instead of re-using o1
402033: push   [447C10h]    
402039: call   4020BC              ; 2) call cas    
40203E: xor    al,1                ; 3) test success
402040: jne    402027              ; no success try again
; end of main loop

Here is the test code:

import core.atomic;
import core.thread;
import std.stdio;

enum loops = 0xFFFF;
shared uint cnt;

/* *****************************************************************************
 Implement atomicOp!("+=")(cnt, 1U); with CAS. The code below doesn't work with
 the "-O" compiler flag because cnt is read twice while calling cas and another
 thread can modify cnt in between.
*/
enum threads = 8;

void atomicInc  ( ) { uint o; do { o = cnt; } while ( !cas( &cnt, o, o + 1 ) )开发者_StackOverflow;}
void threadFunc ( ) { foreach (i; 0..loops) atomicInc; }

void testCas ( ) {
    cnt = 0;
    auto tgCas = new ThreadGroup;
    foreach (i; 0..threads) tgCas.create(&threadFunc);
    tgCas.joinAll;
    writeln( "CAS   : ", cnt == loops * threads ? "passed" : "failed" );
}

/* *****************************************************************************
 Dekker's algorithm. Fails on ia32 (other than atom) because ia32 can re-order 
 read before write. Most likely fails on many other architectures.
*/
shared bool flag1 = false;
shared bool flag2 = false;
shared bool turn2 = false;   // avoids starvation by executing 1 and 2 in turns

void dekkerInc ( ) {
    flag1 = true;
    while ( flag2 ) if ( turn2 ) {
        flag1 = false; while ( turn2 )  {  /* wait until my turn */ }
        flag1 = true;
    }
    cnt++;                   // shouldn't work without a cast
    turn2 = true; flag1 = false;
}

void dekkerDec ( ) {
    flag2 = true;
    while ( flag1 ) if ( !turn2 ) {
        flag2 = false; while ( !turn2 ) { /* wait until my turn */ }
        flag2 = true;
    }
    cnt--;                   // shouldn't work without a cast
    turn2 = false; flag2 = false;
}

void threadDekkerInc ( ) { foreach (i; 0..loops) dekkerInc; }
void threadDekkerDec ( ) { foreach (i; 0..loops) dekkerDec; }

void testDekker ( ) {
    cnt = 0;
    auto tgDekker = new ThreadGroup;
    tgDekker.create( &threadDekkerInc );
    tgDekker.create( &threadDekkerDec );
    tgDekker.joinAll;
    writeln( "Dekker: ", cnt == 0 ? "passed" : "failed" );
}

/* ************************************************************************** */
void main() {
    testCas;
    testDekker;
}

While the issues still seem to exist, core.atomic now exposes atomicLoad which enables a relative straightforward workaround. To make the cas example work, it suffices to load cnt atomically:

void atomicInc  ( ) { 
    uint o; 
    do {
         o = atomicLoad(cnt); 
    } while ( !cas( &cnt, o, o + 1 ) );
}

Similarly, to make Dekker's algorithm work:

// ...
while ( atomicLoad(flag2) ) if ( turn2 ) {
// ...
while ( atomicLoad(flag1) ) if ( !turn2 ) {
// ...

For architectures other than ia32 (ignoring string operations and SSE) that can also re-order

reads relative to reads
or writes relative to writes
or writes and reads to the same memory location

additional memory barriers would be required.

Yes, code it in assembler. If you skip using the cas() function and just write your entire atomicInt function in assembly, it's only going to be a few lines of code. Until you do so, you're probably going to be fighting against the compiler's optimizations.

On top of all that, you can use the x86 LOCK INC instruction instead of CAS and you should be able to reduce the function to just a line or two of assembly.