Threads synchronization. How exactly lock makes access to memory 'correct'?_问答_开发者

First of all, I know that lock{} is synthetic sugar for Monitor class. (oh, syntactic sugar)

I was playing with simple multithreading problems and discovered that cannot totally understand how lockng some arbitrary WORD of memory secures whole other memory from being cached is registers/CPU cache etc. It's easier to use code samples to explain what I'm saying about:

for (int i = 0; i < 100 * 1000 * 1000; ++i) {
    ms_Sum += 1;
}

In the end ms_Sum will contain 100000000 which is, of course, expected.

Now we age going to execute same cycle but on 2 different threads and with upper limit halved.

for (int i = 0; i < 50 * 1000 * 1000; ++i) {
    ms_Sum += 1;
}

Because of no synchronization we get incorrect result - on my 4-core machine it is random number nearly 52 388 219 which is slightly larger than half from 100 000 000. If we enclose ms_Sum += 1; in lock {}, we, of cause, would get absolutely correct result 100 000 000. But what's interesting for me (truly saying I was expecting alike behavior) that adding lo开发者_Python百科ck before of after ms_Sum += 1; line makes answer almost correct:

for (int i = 0; i < 50 * 1000 * 1000; ++i) {
    lock (ms_Lock) {}; // Note curly brackets

    ms_Sum += 1;
}

For this case I usually get ms_Sum = 99 999 920, which is very close.

Question: why exactly lock(ms_Lock) { ms_Counter += 1; } makes program completely correct but lock(ms_Lock) {}; ms_Counter += 1; only almost correct; how locking arbitrary ms_Lock variable makes whole memory stable?

Thanks a lot!

P.S. Gone to read books about multithreading.

SIMILAR QUESTION(S)

How does the lock statement ensure intra processor synchronization?

Thread synchronization. Why exactly this lock isn't enough to synchronize threads

why exactly does lock(ms_Lock) { ms_Counter += 1; } make the program completely correct but lock(ms_Lock) {}; ms_Counter += 1; only almost correct?

Good question! The key to understanding this is that a lock does two things:

It causes any thread that contests the lock to pause until the lock can be taken
It causes a memory barrier, also sometimes called a "full fence"

I do not totally understand how lockng some arbitrary object prevents other memory from being cached in registers/CPU cache, etc

As you note, caching memory in registers or the CPU cache can cause odd things to happen in multithreaded code. (See my article on volatility for a gentle explanation of a related topic..) Briefly: if one thread makes a copy of a page of memory in the CPU cache before another thread changes that memory, and then the first thread does a read from the cache, then effectively the first thread has moved the read backwards in time. Similarly, writes to memory can appear to be moved forwards in time.

A memory barrier is like a fence in time that tells the CPU "do what you need to do to ensure that reads and writes that are moving around through time cannot move past the fence".

An interesting experiment would be to instead of an empty lock, put a call to Thread.MemoryBarrier() in there and see what happens. Do you get the same results or different ones? If you get the same result, then it is the memory barrier that is helping. If you do not, then the fact that the threads are being almost synchronized correctly is what is slowing them down enough to prevent most races.

My guess is that it is the latter: the empty locks are slowing the threads down enough that they are not spending most of their time in the code that has a race condition. Memory barriers are not typically necessary on strong memory model processors. (Are you on an x86 machine, or an Itanium, or what? x86 machines have a very strong memory model, Itaniums have a weak model that needs memory barriers.)

You don't say how many threads you used, but I am guessing two - if you ran with four threads, I'd expect the unlocked version to wind up with a result that's reasonably close to 1/4 of the single-threaded version 'correct' result.

When you don't use lock, your quad-proc machine allocates a thread to each CPU (this statement discounts the presence of other apps that will also get scheduled in turn, for simplicity) and they run at full speed, free of interference with each other. Each thread gets the value from memory, increments it and stores it back to memory. The result overwrites what's there, which means that, since you have 2 (or 3, or 4) threads running at full speed at the same time, some of the increments made by threads on your other cores effectively get thrown away. Your final result is thus lower than what you got from a single thread.

When you add the lock statement, this tells the CLR (this looks like C#?) to ensure that only one thread, on any available core, can execute that code. This is a critical change from the situation above, since the multiple threads now interfere with each other, even though as you realise this code is not thread-safe (just close enough to that to be dangerous). This incorrect serialization results (as a side effect) in the ensuing increment being executed concurrently less often - since the implied unlock requires an expensive, in the terms of this code and your multi-core CPU, at least, awakening of any threads that were waiting for the lock. This multi-threaded version will also run slower than the single-threaded version because of this overhead. Threads do not always make code faster.

While any waiting threads are waking up from their wait state, the lock-releasing thread can continue running in its time-slice, and often will get, increment and store the variable before the awakening threads get a chance to take a copy of the variable from memory for their own increment op. Thus you wind up with a final value that's close to the single-threaded version, or what you'd get if you lock-ed the increment inside the loop.

Check out the Interlocked class for a hardware-level way to treat variables of a certain type atomically.

If you have no locking around the shared variable ms_Sum, then both threads are able to access the ms_Sum variable and increment the value without restriction. 2 threads running in parallel on a dual-core machine will both operate on the variable at the same time.

Memory: ms_Sum = 5
Thread1: ms_Sum += 1: ms_Sum = 5+1 = 6
Thread2: ms_Sum += 1: ms_Sum = 5+1 = 6 (running in parallel).

Here is a rough breakdown in which things are happening as best I can explain:

1: ms_sum = 5.
2: (Thread 1) ms_Sum += 1;
3: (Thread 2) ms_Sum += 1;
4: (Thread 1) "read value of ms_Sum" -> 5
5: (Thread 2) "read value of ms_Sum" -> 5
6: (Thread 1) ms_Sum = 5+1 = 6
6: (Thread 2) ms_Sum = 5+1 = 6

It makes sense that with no synchronization/locking you get a result of roughly half the expected total since 2 threads can do things "almost" twice as fast.

With proper synchronization, ie lock(ms_Lock) { ms_Counter += 1; }, the order changes to be more like this:

 1: ms_sum = 5.
 2: (Thread 1) OBTAIN LOCK. ms_Sum += 1;
 3: (Thread 2) WAIT FOR LOCK.
 4: (Thread 1) "read value of ms_Sum" -> 5
 5: (Thread 1) ms_Sum = 5+1 = 6
 6. (Thread 1) RELEASE LOCK.
 7. (Thread 2) OBTAIN LOCK.  ms_Sum += 1;
 8: (Thread 2) "read value of ms_Sum" -> 6
 9: (Thread 2) ms_Sum = 6+1 = 7
10. (Thread 2) RELEASE LOCK.

As for why lock(ms_Lock) {}; ms_Counter += 1; is "almost" correct, I think you're just getting lucky. The lock forces each thread to slow down and "wait their turn" to obtain and release the lock. The fact that the arithmetic operation ms_Sum += 1; is so trivial (it runs very fast) is probably why the result is "almost" ok. By the time thread 2 has performed the overhead of obtaining and releasing the lock, the simple arithmetic is likely already done by thread 1 so you get close to the desired result. If you were doing something more complex (taking more processing time) you'd find that it wouldn't get as close to your desired result.

We have been discussing this with deafsheep and our current idea can be represented as the following schema

Threads synchronization. How exactly lock makes access to memory 'correct'?

Time is running left to right, and 2 threads are represented by two rows.

where

black box represents process of acquiring, holding and releasing the lock
plus represents addition operation ( schema represents scale on my PC, lock takes approximated 20 times longer than add)
white box represents period that consists of try to acquire lock, and further awaiting for it to become available

Order of black boxes is always like this, they cannot overlap , and they should always follow each other very closely. Consequently, it becomes very logical, that pluses never overlap, and we should come up precisely to expected sum.

The source of existing error is explored in this question:

Here is the answer.

I didn't read all the other answers all the way because they were too long and I saw stuff that wasn't right, and the answer doesn't need to be that long. Maybe the answer by Sedat was the closest. It doesn't really have anything to do with the lock statement 'slowing down' the speed of the program.

It has to do with the cache synchronization of ms_sum between the 2 threads. Each thread has its own cached copy of ms_sum.

In your 1st example, because you are not using 'lock', you are leaving it up to the OS as to when to do the sync (when to copy the updated cache value back to main memory or when to read it from main memory into the cache). So, each thread is basically updating it's own copy of ms_sum. Now, the sync does happen from time to time, but not on every thread context switch, which causes the result to be a little more than 50,000,000. If it happened on every thread context switch, you would get 10,000,000.

In the 2nd example, ms_sum is syncked on every iteration. This keeps ms_sum #1 & ms_sum #2 pretty well syncked. So, you are going to get almost 10,000,000. But it's not going to be all the way to 10,000,000 because every time the thread context switches, ms_sum can be off by 1 because you have the += happening outside the lock.

Now, in general, exactly what parts of various threads' caches are syncked when lock is called, is a little bit unknown to me. But because of your result of almost 10,000,000 in your 2nd example, I can see that your lock call is causing ms_sum to be syncked.