I found following question: Is fastcall really faster?
No clear answers for x86 were given so I decided to create benchmark.
Here is the code:
#include <time.h>
int __fastcall func(int i)
{
return i + 5;
}
int _stdcall func2(int i)
{
return i + 5;
}
int _tmain(int argc, _TCHAR* argv[])
{
int iter = 100;
int x = 0;
clock_t t = clock();
for (int j = 0; j <= iter;j++)
for (int i = 0; i <= 1000000;i++)
x = func(x & 0xFF);
printf("%d\n", clock() - t);
t = clock();
for (int j = 0; j <= iter;j++)
for (int i = 0; i <= 1000000;i++)
开发者_开发百科 x = func2(x & 0xFF);
printf("%d\n", clock() - t);
printf("%d", x);
return 0;
}
In case of no optimization result in MSVC 10 is:
4671
4414
With max optimization fastcall
is sometimes faster, but I guess it is multitasking noise. Here is average result (with iter = 5000
)
6638
6487
stdcall
looks faster!
Here are results for GCC: http://ideone.com/hHcfP
Again, fastcall
lost race.
Here is part of disassembly in case of fastcall
:
011917EF pop ecx
011917F0 mov dword ptr [ebp-8],ecx
return i + 5;
011917F3 mov eax,dword ptr [i]
011917F6 add eax,5
this is for stdcall
:
return i + 5;
0119184E mov eax,dword ptr [i]
01191851 add eax,5
i
is passed via ECX
, instead of stack, but saved into stack in the body! So all the effect is neglected! this simple function can be calculated using only registers! And there is no real difference between them.
Can anyone explain what is reason for fastcall
? Why doesn't it give speedup?
Edit: With optimization it turned out that both functions are inlined. When I turned inlining off they both are compiled to:
00B71000 add eax,5
00B71003 ret
This looks like great optimization, indeed, but it doesn't respect calling conventions at all, so test is not fair.
__fastcall
was introduced a long time ago. At the time, Watcom C++ was beating Microsoft for optimization, and a number of reviewers picked out its register-based calling convention as one (possible) reason why.
Microsoft responded by adding __fastcall
, and they've retained it ever since -- but I don't think they ever did much more than enough to be able to say "we have a register-based calling convention too..." Their preference (especially since the 32-bit migration) seems to be for __stdcall
. They've put quite a bit of work into improving their code generation with it, but (apparently) not nearly so much with __fastcall
. With on-chip caching, the gain from passing things in registers isn't nearly as great as it was then anyway.
Your micro-benchmark produces irrelevant results. __fastcall
has specific uses with SSE instructions (see XNAMath) , clock()
is not even remotely a suitable timer for benchmarking, and __fastcall
exists for multiple platforms like Itanium and some others too, not just for x86, and in addition, your whole program can be effectively optimized to nothing except the printf
statements, making the relative performance of __fastcall
or __stdcall
very, very irrelevant.
Finally, you've forgotten to realize the main reason that a lot of things are done the way they are- legacy. __fastcall
may well have been significant before compiler inlining became as aggressive and effective as it is today, and no compiler will remove __fastcall
as there will be programs that depend on it. That makes __fastcall
a fact of life.
Several reasons
- At least in most decent x86 implementations, register renaming is in effect -- the effort that looks like's being saved by using a register instead of memory might not be doing anything on the hardware level.
- Sure, you save some stack movement effort with
__fastcall
, but you reduce the number of registers available for use in the function without modifying the stack.
Most of the time where __fastcall
would be faster the function is simple enough to be inlined in any case, which means that it really doesn't matter in real software. (Which is one of the main reasons why __fastcall
is not often used)
Side note: What was wrong with Anon's answer?
Fastcall is really only meaningful if you use full optimization (otherwise its effects will be buried by other artifacts), but as you note, with full optimization, the functions will be inlined and you won't see the effect of calling conventions at all.
So to actually test this, you need to make the functions extern
declarations with the actual definitions in a separate source file that you compile separately and link with your main routine. When you do that, you'll see that __fastcall is consistently ~25% faster with small functions like this.
The upshot is that __fastcall is really only useful if you have a lot of calls to tiny functions that can't be inlined because they need to be separately compiled.
Edit
So with separate compilation and gcc -O3 -fomit-frame-pointer -m32
I see quite different code for the two functions:
func:
leal 5(%ecx), %eax
ret
func2:
movl 4(%esp), %eax
addl $5, %eax
ret
Running that with iter=5000 consistently gives me results close to
9990000
14160000
indicating that the fastcall version is a shade over 40% faster.
I compiled the two function with i686-w64-mingw32-gcc -O2 -fno-inline fastcall.c
. This is the assembly generated for func
and func2
:
@func@4:
leal 5(%ecx), %eax
ret
_func2@4:
movl 4(%esp), %eax
addl $5, %eax
ret $4
__fastcall really looks faster to me. func2
needs to load the input parameter from the stack. func
can simply perform a %eax := %ecx + 5
and then returns to the caller.
Furthermore, the output of your programming is typically like this on my system:
2560
3250
154
So __fastcall does not only look faster, it is faster.
Also note that on x86_64 (or x64 as Microsoft calls it), __fastcall is the default and the old non-fastcall convetion does not exist anymore. http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions
By making __fastcall the default, x86_64 catches up with other architectures (such as ARM), where passing arguments in registers is also default.
Fastcall itself as a register based calling convention isn't great on x86 because there aren't that many named registers available and by using key registers for passing the values, all you're doing is potentially forcing the calling code to push other values onto the stack and forcing the called function if it is of sufficient complexity to do the same. Essentially from an assembly language perspective, you're increasing the pressure on those named registers and explicitly using stack operations to compensate. So even if the CPU has far more registers available for renaming, it isn't going to refactor the explicit stack operations that have to be inserted.
On the other hand, on more "register rich" architectures like x86-64, register based calling conventions (not exactly the same as fastcall of old, but same concept) are the norm and are used across the board. In other words, once we got out of a few named registers architecture like x86, to something with more register space, fastcall was back in a big way and became the default and really only way used today.
Note: even edited in May 2017 by the OP, this question and answers are likely to be way out of date and not relevant any more by 2019 (if not a few years ago earlier).
A) By at minimal MSVC 2017 (and 2019 released recently). most of the code is going to be inlined in optimized release builds anyhow. Probably the only function body you will see in the entire example now is "_tmain()".
That is unless you specifically do some tricks like declaring the functions as "volatile" and/or wrapping the test functions in pragmas that turn off some optimizations.
B) The latest generation of desktop CPUs (the assumption here) are much improved since the circa 2010 generation. They are much are better at caching the stack, memory alignment matters less, etc.
But don't take my word for it. Load up your executable in a dissembler (IDA Pro, MSVC debugger, etc.) and look for your self (a good way to learn).
Now it would be interesting to see what the performance would be over a large 32bit application. Example, take the last Open sourced DOOM game release and make builds with stdcall and _fastcall and look for framerate differences. And get metrics off of any built-in performance reporting features it has et al.
It does not appear that __fastcall actually indicates that it will be faster. Seems like all you're doing is moving the first fiew variables into registers before making the call to the function. This most likely makes your function call slower since it must move the variables into those registers first. Wikipedia had a pretty good write up about what exactly Fast Call is and how it is implemented.
精彩评论