开发者

Why is fastcall slower than stdcall?

开发者 https://www.devze.com 2023-02-20 11:27 出处:网络
I found following question: Is fastcall really faster? No clear answers for x86 were given so I decided to create benchmark.

I found following question: Is fastcall really faster?

No clear answers for x86 were given so I decided to create benchmark.

Here is the code:

#include <time.h>

int __fastcall func(int i)
{   
    return i + 5;
}

int _stdcall func2(int i)
{   
    return i + 5;
}

int _tmain(int argc, _TCHAR* argv[])
{
    int iter = 100;
    int x = 0;
    clock_t t = clock();
    for (int j = 0; j <= iter;j++)
        for (int i = 0; i <= 1000000;i++)
            x = func(x & 0xFF);
    printf("%d\n", clock() - t);
    t = clock();
    for (int j = 0; j <= iter;j++)
        for (int i = 0; i <= 1000000;i++)
      开发者_开发百科      x = func2(x & 0xFF);
    printf("%d\n", clock() - t);
    printf("%d", x);
    return 0;
}

In case of no optimization result in MSVC 10 is:

4671
4414

With max optimization fastcall is sometimes faster, but I guess it is multitasking noise. Here is average result (with iter = 5000)

6638
6487

stdcall looks faster!

Here are results for GCC: http://ideone.com/hHcfP Again, fastcall lost race.

Here is part of disassembly in case of fastcall:

011917EF  pop         ecx  
011917F0  mov         dword ptr [ebp-8],ecx  
    return i + 5;
011917F3  mov         eax,dword ptr [i]  
011917F6  add         eax,5

this is for stdcall:

    return i + 5;
0119184E  mov         eax,dword ptr [i]  
01191851  add         eax,5  

i is passed via ECX, instead of stack, but saved into stack in the body! So all the effect is neglected! this simple function can be calculated using only registers! And there is no real difference between them.

Can anyone explain what is reason for fastcall? Why doesn't it give speedup?

Edit: With optimization it turned out that both functions are inlined. When I turned inlining off they both are compiled to:

00B71000  add         eax,5  
00B71003  ret  

This looks like great optimization, indeed, but it doesn't respect calling conventions at all, so test is not fair.


__fastcall was introduced a long time ago. At the time, Watcom C++ was beating Microsoft for optimization, and a number of reviewers picked out its register-based calling convention as one (possible) reason why.

Microsoft responded by adding __fastcall, and they've retained it ever since -- but I don't think they ever did much more than enough to be able to say "we have a register-based calling convention too..." Their preference (especially since the 32-bit migration) seems to be for __stdcall. They've put quite a bit of work into improving their code generation with it, but (apparently) not nearly so much with __fastcall. With on-chip caching, the gain from passing things in registers isn't nearly as great as it was then anyway.


Your micro-benchmark produces irrelevant results. __fastcall has specific uses with SSE instructions (see XNAMath) , clock() is not even remotely a suitable timer for benchmarking, and __fastcall exists for multiple platforms like Itanium and some others too, not just for x86, and in addition, your whole program can be effectively optimized to nothing except the printf statements, making the relative performance of __fastcall or __stdcall very, very irrelevant.

Finally, you've forgotten to realize the main reason that a lot of things are done the way they are- legacy. __fastcall may well have been significant before compiler inlining became as aggressive and effective as it is today, and no compiler will remove __fastcall as there will be programs that depend on it. That makes __fastcall a fact of life.


Several reasons

  1. At least in most decent x86 implementations, register renaming is in effect -- the effort that looks like's being saved by using a register instead of memory might not be doing anything on the hardware level.
  2. Sure, you save some stack movement effort with __fastcall, but you reduce the number of registers available for use in the function without modifying the stack.

Most of the time where __fastcall would be faster the function is simple enough to be inlined in any case, which means that it really doesn't matter in real software. (Which is one of the main reasons why __fastcall is not often used)

Side note: What was wrong with Anon's answer?


Fastcall is really only meaningful if you use full optimization (otherwise its effects will be buried by other artifacts), but as you note, with full optimization, the functions will be inlined and you won't see the effect of calling conventions at all.

So to actually test this, you need to make the functions extern declarations with the actual definitions in a separate source file that you compile separately and link with your main routine. When you do that, you'll see that __fastcall is consistently ~25% faster with small functions like this.

The upshot is that __fastcall is really only useful if you have a lot of calls to tiny functions that can't be inlined because they need to be separately compiled.

Edit

So with separate compilation and gcc -O3 -fomit-frame-pointer -m32 I see quite different code for the two functions:

func:
    leal    5(%ecx), %eax
    ret
func2:
    movl    4(%esp), %eax
    addl    $5, %eax
    ret

Running that with iter=5000 consistently gives me results close to

9990000
14160000

indicating that the fastcall version is a shade over 40% faster.


I compiled the two function with i686-w64-mingw32-gcc -O2 -fno-inline fastcall.c. This is the assembly generated for func and func2:

@func@4:
    leal    5(%ecx), %eax
    ret
_func2@4:
    movl    4(%esp), %eax
    addl    $5, %eax
    ret $4

__fastcall really looks faster to me. func2 needs to load the input parameter from the stack. func can simply perform a %eax := %ecx + 5 and then returns to the caller.

Furthermore, the output of your programming is typically like this on my system:

2560
3250
154

So __fastcall does not only look faster, it is faster.

Also note that on x86_64 (or x64 as Microsoft calls it), __fastcall is the default and the old non-fastcall convetion does not exist anymore. http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions

By making __fastcall the default, x86_64 catches up with other architectures (such as ARM), where passing arguments in registers is also default.


Fastcall itself as a register based calling convention isn't great on x86 because there aren't that many named registers available and by using key registers for passing the values, all you're doing is potentially forcing the calling code to push other values onto the stack and forcing the called function if it is of sufficient complexity to do the same. Essentially from an assembly language perspective, you're increasing the pressure on those named registers and explicitly using stack operations to compensate. So even if the CPU has far more registers available for renaming, it isn't going to refactor the explicit stack operations that have to be inserted.

On the other hand, on more "register rich" architectures like x86-64, register based calling conventions (not exactly the same as fastcall of old, but same concept) are the norm and are used across the board. In other words, once we got out of a few named registers architecture like x86, to something with more register space, fastcall was back in a big way and became the default and really only way used today.


Note: even edited in May 2017 by the OP, this question and answers are likely to be way out of date and not relevant any more by 2019 (if not a few years ago earlier).

A) By at minimal MSVC 2017 (and 2019 released recently). most of the code is going to be inlined in optimized release builds anyhow. Probably the only function body you will see in the entire example now is "_tmain()".

That is unless you specifically do some tricks like declaring the functions as "volatile" and/or wrapping the test functions in pragmas that turn off some optimizations.

B) The latest generation of desktop CPUs (the assumption here) are much improved since the circa 2010 generation. They are much are better at caching the stack, memory alignment matters less, etc.

But don't take my word for it. Load up your executable in a dissembler (IDA Pro, MSVC debugger, etc.) and look for your self (a good way to learn).

Now it would be interesting to see what the performance would be over a large 32bit application. Example, take the last Open sourced DOOM game release and make builds with stdcall and _fastcall and look for framerate differences. And get metrics off of any built-in performance reporting features it has et al.


It does not appear that __fastcall actually indicates that it will be faster. Seems like all you're doing is moving the first fiew variables into registers before making the call to the function. This most likely makes your function call slower since it must move the variables into those registers first. Wikipedia had a pretty good write up about what exactly Fast Call is and how it is implemented.

0

精彩评论

暂无评论...
验证码 换一张
取 消