开发者

Does gcc automatically "unroll" if-statements?

开发者 https://www.devze.com 2023-02-11 01:27 出处:网络
Say I have a loop that looks like this: for(int i = 0; i < 10000; i++) { /* Do something computationally expensive */

Say I have a loop that looks like this:

for(int i = 0; i < 10000; i++) {
    /* Do something computationally expensive */
    if (i < 200 && !(i%20)) {
        /* Do something else */
    }
}

wherein some trivial task gets stuck behind an if-statement that only runs a handful of times. I've always heard that "if-statements in loops are slow!" So, in the hopes of (marginally) increased performance, I split the loops apart into:

for(int i = 0; i < 200; i++) {
    /* Do something computationally expensive */
    if (!(i%20)) {
        /* Do something else */
    }
}

for(int i = 200; i < 10000; i++) {
    /* Do something computationally expensive */
}

Will gcc (with the appropriate flags, like -O3) automatically break开发者_如何学运维 the one loop into two, or does it only unroll to decrease the number of iterations?


Why not just disassemble the program and see for yourself? But here we go. This is the testprogram:

int main() {
    int sum = 0;
    int i;
    for(i = 0; i < 10000; i++) {
        if (i < 200 && !(i%20)) {
            sum += 0xC0DE;
        }
        sum += 0xCAFE;
    }
    printf("%d\n", sum);
    return 0;
}

and this is the interesting part of the disassembled code compiled with gcc 4.3.3 and -o3:

0x08048404 <main+20>:   xor    ebx,ebx
0x08048406 <main+22>:   push   ecx
0x08048407 <main+23>:   xor    ecx,ecx
0x08048409 <main+25>:   sub    esp,0xc
0x0804840c <main+28>:   lea    esi,[esi+eiz*1+0x0]
0x08048410 <main+32>:   cmp    ecx,0xc7
0x08048416 <main+38>:   jg     0x8048436 <main+70>
0x08048418 <main+40>:   mov    eax,ecx
0x0804841a <main+42>:   imul   esi
0x0804841c <main+44>:   mov    eax,ecx
0x0804841e <main+46>:   sar    eax,0x1f
0x08048421 <main+49>:   sar    edx,0x3
0x08048424 <main+52>:   sub    edx,eax
0x08048426 <main+54>:   lea    edx,[edx+edx*4]
0x08048429 <main+57>:   shl    edx,0x2
0x0804842c <main+60>:   cmp    ecx,edx
0x0804842e <main+62>:   jne    0x8048436 <main+70>
0x08048430 <main+64>:   add    ebx,0xc0de
0x08048436 <main+70>:   add    ecx,0x1
0x08048439 <main+73>:   add    ebx,0xcafe
0x0804843f <main+79>:   cmp    ecx,0x2710
0x08048445 <main+85>:   jne    0x8048410 <main+32>
0x08048447 <main+87>:   mov    DWORD PTR [esp+0x8],ebx
0x0804844b <main+91>:   mov    DWORD PTR [esp+0x4],0x8048530
0x08048453 <main+99>:   mov    DWORD PTR [esp],0x1
0x0804845a <main+106>:  call   0x8048308 <__printf_chk@plt>

So as we see, for this particular example, no it does not. We have only one loop starting at main+32 and ending at main+85. If you've got problems reading the assembly code ecx = i; ebx = sum.

But still your mileage may vary - who knows what heuristics are used for this particular case, so you'll have to compile the code you've got in mind and see how longer/more complicated computations influence the optimizer.

Though on any modern CPU the branch predictor will do pretty good on such easy code, so you won't see much performance losses in either case. What's the performance loss of maybe a handful mispredictions if your computation intense code needs billions of cycles?

0

精彩评论

暂无评论...
验证码 换一张
取 消