Okay, i have this question in one regarding threads.
there are two unsynchronized threads running simultaneously and using a global resource "int num" 1st:
void Thread()
{
int i;
for ( i=0 ; i < 100000000; i++ )
{
num++;
num--;
}
}
2nd:
void Thread2()
{
int j;
for ( j=0 ; j < 100000000; j++ )
{
num++;
num--;
}
}
The question states: what are the possible values of the variable "num" at the end of the program. now i would say 0 will be the value of num at the end of the program but, try and run this code and you will find out that the result is quite random, and i can't understand why?
The full code:
#include <windows.h>
#include <process.h>
#include <stdio.h>
int static num=0;
void Thread()
{
int i;
for ( i=0 ; i < 100000000; i++ )
{
num++;
num--;
}
}
void Thread2()
{
int j;
for ( j=0 ; j < 10开发者_Python百科0000000; j++ )
{
num++;
num--;
}
}
int main()
{
long handle,handle2,code,code2;
handle=_beginthread( Thread, 0, NULL );
handle2=_beginthread( Thread2, 0, NULL );
while( (GetExitCodeThread(handle,&code)||GetExitCodeThread(handle2,&code2))!=0 );
TerminateThread(handle, code );
TerminateThread(handle2, code2 );
printf("%d ",num);
system("pause");
}
num++
and num--
don't have to be atomic operations. To take num++
as an example, this is probably implemented like:
int tmp = num;
tmp = tmp + 1;
num = tmp;
where tmp
is held in a CPU register.
Now let's say that num == 0
, both threads try to execute num++
, and the operations are interleaved as follows:
Thread A Thread B
int tmp = num;
tmp = tmp + 1;
int tmp = num;
tmp = tmp + 1;
num = tmp;
num = tmp;
The result at the end will be num == 1
even though it should have been incremented twice. Here, one increment is lost; in the same way, a decrement could be lost as well.
In pathological cases, all increments of one thread could be lost, resulting in num == -100000000
, or all decrements of one thread could be lost, resulting in num == +100000000
. There may even be more extreme scenarios lurking out there.
Then there's also other business going on, because num
isn't declared as volatile. Both threads will therefore assume that the value of num
doesn't change, unless they are the one changing it. This allows the compiler to optimize away the entire for
loop, if it feels so inclined!
The possible values for num
include all possible int
values, plus floating point values, strings, and jpegs of nasal demons. Once you invoke undefined behavior, all bets are off.
More specifically, modifying the same object from multiple threads without synchronization results in undefined behavior. On most real-world systems, the worst effects you see will probably be missing or double increments or decrements, but it could be much worse (memory corruption, crashing, file corruption, etc.). So just don't do it.
The next upcoming C and C++ standards will include atomic types which can be safely accessed from multiple threads without any synchronization API.
You speak of threads running simultaneously which actually might not be the case if you only have one core in your system. Let's assume that you have more than one.
In the case of multiple devices having access to main memory either in the form of CPUs or bus-mastering or DMA they must be synchronized. This is handled by the lock prefix (implicit for the instruction xchg). It accesses a physical wire on the system bus which essentially signals all devices present to stay away. It is, for example, part of the Win32 function EnterCriticalSection.
So in the case of two cores on the same chip accessing the same position the result would be undefined which may seem strange considering some synchronization should occur since they share the same L3 cache (if there is one). Seems logical, but it doesn't work that way. Why? Because a similar case occurs when you have the two cores on different chips (i e don't have a shared L3 cache). You can't expect them to be synchronized. Well you can but consider all the other devices having access to main memory. If you plan to synchronize between two CPU chips you can't stop there - you have to perform a full-blown synchronization that blocks out all devices with access and to ensure a successful synchronization all the other devices need time to recognize that a synchronization has been requested and that takes a long time, especially if a device has been granted access and is performing a bus-mastering operation which must be allowed to complete. The PCI bus will perform an operation every 0.125 us (8 MHz) and considering that your CPUs run at 400 times you're looking at A LOT of wait states. Then consider that several PCI clock cycles might be required.
You could argue that a medium type (memory bus only) lock should exist but this means an additional pin on every processor and additional logic in every chipset just to handle a case which is really a misunderstanding on the programmer's part. So it's not implemented.
To sum it up: a generic synchronization that would handle your situation would render your PC useless due to it always having to wait for the last device to check in and ok the synchronization. It is a better solution to let it be optional and only insert wait states when the developer has determined that it is absolutely necessary.
This was so much fun that I played a little with the example code and added spinlocks to see what would happen. The spinlock components were
// prototypes
char spinlock_failed (spinlock *);
void spinlock_leave (spinlock *);
// application code
while (spinlock_failed (&sl)) ++n;
++num;
spinlock_leave (&sl);
while (spinlock_failed (&sl)) ++n;
--num;
spinlock_leave (&sl);
spinlock_failed was constructed around the "xchg mem,eax" instruction. Once it failed (at not setting the spinlock <=> succeeded at setting it) spinlock_leave would just assign to it with "mov mem,0". The "++n" counts the total number of retries.
I changed the loop to 2.5 million (because with two threads and two spinlocks per loop I get 10 million spinlocks, nice and easy to round with) and timed the sequences with the "rdtsc" count on a dual-core Athlon II M300 @ 2GHz and this is what I found
- Running one thread without timing (except for the main loop) and locks (as in the original example) 33748884 <=> 16.9 ms => 13.5 cycles/loop.
- Running one thread i e no other core trying took 210917969 cycles <=> 105.5 ms => 84,4 cycles/loop <=> 0.042 us/loop. The spinlocks required 112581340 cycles <=> 22.5 cycles per spinlocked sequence. Still, the slowest spinlock required 1334208 cycles: that's 667 us or only 1500 every second.
So, the additon of spinlocks unaffected by another CPU added several hundred percent to the total execution time. The final value in num was 0.
- Running two threads without spinlocks took 171157957 cycles <=> 85.6 ms => 68.5 cycles/loop. Num contained 10176.
- Two threads with spinlocks took 4099370103 <=> 2049 ms => 1640 cycles/loop <=> 0.82 us/loop. The spinlocks required 3930091465 cycles => 786 cycles per spinlocked sequence. The slowest spinlock required 27038623 cycles: thats 13.52 ms or only 74 every second. Num contained 0.
Incidentally the 171157957 cycles for two threads without spinlocks compares very favorably to two threads with spinlocks where the spinlock time has been removed: 4099370103-3930091465 = 169278638 cycles.
For my sequence the spinlock competition caused 21-29 million retries per thread which comes out to 4.2-5.8 retries per spinlock or 5.2-6.8 tries per spinlock. Addition of spinlocks caused an execution time penalty of 1927% (1500/74-1). The slowest spinlock required 5-8% of all tries.
As Thomas said, the results are unpredictable because your increment and decrement are non-atomic. You can use InterlockedIncrement and InterlockedDecrement -- which are atomic -- to see a predictable result.
- Interlocked Variable Access (MSDN)
精彩评论