Edit: My original example had a silly mistake. After fixing it I still get weird results, though.
In my naive attempt to measure my CPU speed the "brute-force" way, I made the program below:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#pragma comment(linker, "/en开发者_StackOverflow社区try:mainCRTStartup")
#pragma comment(linker, "/Subsystem:Console")
int mainCRTStartup()
{
char buf[20];
clock_t start, elapsed;
unsigned long count = 0;
start = clock();
__asm
{
mov EAX, 0;
_loop:
add EAX, 3; // accounts for itself and next 2 instructions
cmp EAX, 0xFFFFFFFF - 0x400;
jb _loop;
mov count, EAX;
}
elapsed = clock() - start;
_gcvt(count * (long long)CLOCKS_PER_SEC / (elapsed * 1000000000.0), 3, buf);
puts(buf);
}
Which disassembles into something like:
mainCRTStartup:
push ebp
mov ebp,esp
sub esp,28h
mov dword ptr [count],0
call dword ptr [_clock]
mov dword ptr [start],eax
mov eax,0
_loop:
add eax,03h
cmp eax,0FFFFFBFFh
jb _loop
mov dword ptr [count],eax
call dword ptr [_clock]
sub eax,dword ptr [start]
... // call _gcvt, _puts, etc.
mov esp,ebp
pop ebp
ret
Notice that the loop is 3 instructions, so the final value of eax
should be the total number of instructions.
Why do I get 4.2 when I run this?
Because instruction-level parallelism and superscalar architecture allow multiple instructions to execute in a single pipelined clock cycle.
For example, in your code, branch prediction effectively eliminates the cmp
instruction for all but the last _loop
iteration, by:
- executing
cmp
andjb
in parallel, and - always taking the
jb
branch.
Of course, (2) is thrown out on the last iteration, which causes the pipeline to be cleared. The extra ~20 cycles (for a 20-stage pipeline) are negligible since your loop is on the order of 10^9 instructions.
the compiler shouldn't be optimizing this
The processor hardware is always looking for optimization opportunities in the datapath; compilers just try to organize instructions to exploit a given architecture's patterns. E.g., hardware pipelining can increase IPC without software pipelining, especially for relatively hazard-free code such as your example.
Because CPU speed is not measured in bytes per sec, but instruction cycles per sec, and especially on x86, some instructions take more than 1 cycle.
See this page for instruction timing. (Actually, this only goes up to 486 - still looking for a good reference for modern processors).
How many cycles an instruction takes to execute has no direct relation to it's size in bytes. In addition with the modern CPU features like multiple execution units and speculative execution it's not really possible to determine before hand how long a given chunk of code will take to execute with much accuracy.
You could measure the loop using the rdtsc instruction which counts the cycles of a CPU's internal frequency. The difference between two readings is the number of cycles elapsed. Let your code execute 1000 loops,multiply by three (instructions in a loop) and divide by the elapsed cycles. This will give you instructions per cycle. Which you then can scale to your CPU's own frequency.
Keep in mind that since your code is so short it will most likely execute from cache level 1 (or inside the prefetcher?) which makes it valid for that case only and not for the CPU in general. It may be too short for the pipelining to make something worthwhile out of.
As to instruction timing this page appears more up to date than the one AShelly suggested. It is regularly recomputed by Torbjörn Granlund of the Swedish Royal Institute of Technology.
精彩评论