Out of Order Execution and Memory Fences_问答_开发者

I know开发者_开发百科 that modern CPUs can execute out of order, However they always retire the results in-order, as described by wikipedia.

"Out of Oder processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal."

Now memory fences are said to be required when using multicore platforms, because owing to Out of Order execution, wrong value of x can be printed here.

Processor #1:
 while f == 0
  ;
 print x; // x might not be 42 here

Processor #2:
 x = 42;
 // Memory fence required here
 f = 1

Now my question is, since Out of Order Processors (Cores in case of MultiCore Processors I assume) always retire the results In-Order, then what is the necessity of Memory fences. Don't the cores of a multicore processor sees results retired from other cores only or they also see results which are in-flight?

I mean in the example I gave above, when Processor 2 will eventually retire the results, the result of x should come before f, right? I know that during out of order execution it might have modified f before x but it must have not retired it before x, right?

Now with In-Order retiring of results and cache coherence mechanism in place, why would you ever need memory fences in x86?

This tutorial explains the issues: http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf

FWIW, where memory ordering issues happen on modern x86 processors, the reason is that while the x86 memory consistency model offers quite strong consistency, explicit barriers are needed to handle read-after-write consistency. This is due to something called the "store buffer".

That is, x86 is sequentially consistent (nice and easy to reason about) except that loads may be reordered wrt earlier stores. That is, if the processor executes the sequence

store x
load y

then on the processor bus this may be seen as

load y
store x

The reason for this behavior is the afore-mentioned store buffer, which is a small buffer for writes before they go out on the system bus. Load latency is, OTOH, a critical issue for performance, and hence loads are permitted to "jump the queue".

See Section 8.2 in http://download.intel.com/design/processor/manuals/253668.pdf

The memory fence ensures that all changes to variables before the fence are visible to all other cores, so that all cores have an up to date view of the data.

If you don't put a memory fence, the cores might be working with wrong data, this can be seen especially in scenario's, where multiple cores would be working on the same datasets. In this case you can ensure that when CPU 0 has done some action, that all changes done to the dataset are now visible to all other cores, whom can then work with up to date information.

Some architectures, including the ubiquitous x86/x64, provide several memory barrier instructions including an instruction sometimes called "full fence". A full fence ensures that all load and store operations prior to the fence will have been committed prior to any loads and stores issued following the fence.

If a core were to start working with outdated data on the dataset, how could it ever get the correct results? It couldn't no matter if the end result were to be presented as-if all was done in the right order.

The key is in the store buffer, which sits between the cache and the CPU, and does this:

Store buffer invisible to remote CPUs

Store buffer allows writes to memory and/or caches to be saved to optimize interconnect accesses

That means that things will be written to this buffer, and then at some point will the buffer be written to the cache. So the cache could contain a view of data that is not the most recent, and therefore another CPU, through cache coherency, will also not have the latest data. A store buffer flush is necessary for the latest data to be visible, this, I think is essentially what the memory fence will cause to happen at hardware level.

EDIT:

For the code you used as an example, Wikipedia says this:

A memory barrier can be inserted before processor #2's assignment to f to ensure that the new value of x is visible to other processors at or prior to the change in the value of f.

Just to make explicit what is implicit in the previous answers, this is correct, but is distinct from memory accesses:

CPUs can execute out of order, However they always retire the results in-order

Retirement of the instruction is separate from performing the memory access, the memory access may complete at a different time to instruction retirement.

Each core will act as if it's own memory accesses occur at retirement, but other cores may see those accesses at different times.

(On x86 and ARM, I think only stores are observably subject to this, but e.g., Alpha may load an old value from memory. x86 SSE2 has instructions with weaker guarentees than normal x86 behaviour).

PS. From memory the abandoned Sparc ROCK could in fact retire out-of-order, it spent power and transistors determining when this was harmless. It got abandoned because of power consumption and transistor count... I don't believe any general purpose CPU has been bought to market with out-of-order retirement.