In a program I wrote, 20% of the time is being spent on finding out the minimum of 3 numbers in an inner loop, in this routine:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) m = c;
return m;
}
Is there any way to speed this up? I am ok with assembly code too for x86/x86_64.
Edit: In reply to some of the comments:
* Compiler being used is gcc 4.3.3 * As far as assembly is concerned, I am only a beginner there. I asked for assembly here, to learn how to do this. :) * I have a quad-core Intel 64 running, so MMX/SSE etc. are supported. * It's hard to post the loop here, but I can tell you it's a heavily optimized implementation of the levenshtein algorithm.This is what the compiler is giving me for the non-inlined version of min:
.globl min
.type min, @function
min:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
movl 16(%ebp), %ecx
cmpl %edx, %eax
jbe .L2
movl %edx, %eax
.L2:
cmpl %ecx, %eax
jbe .L3
movl %ecx, %eax
.L3:
popl %ebp
ret
.size min, .-min
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",@progbits
The inlined version is within -O2 optimized code (even my markers mrk = 0xfefefefe, before and after the call to min()) are getting optimized away by gcc, so I couldn't get hold of it.
Update: I tested the changes suggested by Nils, ephemient, however there's no perceptible performance boost I get by using the assembly versions of min(). However, I get a 12.5% boost by compiling the program with -march=i686, which I guess is because the whole program is getting the benefits of the new faster instructions that gcc is generating with this option. Thanks for your help guys.
P.S. - I used the ruby profiler to measure performance (my C program is a shared library lo开发者_Python百科aded by a ruby program), so I could get time spent only for the top-level C function called by the ruby program, which ends up calling min() down the stack. Please see this question.
Make sure you are using an appropriate -march
setting, first off. GCC defaults to not using any instructions that were not supported on the original i386 - allowing it to use newer instruction sets can make a BIG difference at times! On -march=core2 -O2
I get:
min:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
movl 16(%ebp), %eax
cmpl %edx, %ecx
leave
cmovbe %ecx, %edx
cmpl %eax, %edx
cmovbe %edx, %eax
ret
The use of cmov here may help you avoid branch delays - and you get it without any inline asm just by passing in -march
. When inlined into a larger function this is likely to be even more efficient, possibly just four assembly operations. If you need something faster than this, see if you can get the SSE vector operations to work in the context of your overall algorithm.
Assuming your compiler isn't out to lunch, this should compile down to two compares and two conditional moves. It isn't possible to do much better than that.
If you post the assembly that your compiler is actually generating, we can see if there's anything unnecessary that's slowing it down.
The number one thing to check is that the routine is actually getting inlined. The compiler isn't obligated to do so, and if it's generating a function call, that will be hugely expensive for such a simple operation.
If the call really is getting inlined, then loop unrolling may be beneficial, as DigitalRoss said, or vectorization may be possible.
Edit: If you want to vectorize the code, and are using a recent x86 processor, you will want to use the SSE4.1 pminud
instruction (intrinsic: _mm_min_epu32
), which takes two vectors of four unsigned ints each, and produces a vector of four unsigned ints. Each element of the result is the minimum of the corresponding elements in the two inputs.
I also note that your compiler used branches instead of conditional moves; you should probably try a version that uses conditional moves first and see if that gets you any speedup before you go off to the races on a vector implementation.
This drop-in replacement clocks in about 1.5% faster on my AMD Phenom:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
asm("cmp %1,%0\n"
"cmova %1,%0\n"
"cmp %2,%0\n"
"cmova %2,%0\n"
: "+r" (a) : "r" (b), "r" (c));
return a;
}
Results may vary; some x86 processors don't handle CMOV very well.
My take on an x86 assembler implementation, GCC syntax. Should be trivial to translate to another inline assembler syntax:
int inline least (int a, int b, int c)
{
int result;
__asm__ ("mov %1, %0\n\t"
"cmp %0, %2\n\t"
"cmovle %2, %0\n\t"
"cmp %0, %3\n\t"
"cmovle %3, %0\n\t"
: "=r"(result) :
"r"(a), "r"(b), "r"(c)
);
return result;
}
New and improved version:
int inline least (int a, int b, int c)
{
__asm__ (
"cmp %0, %1\n\t"
"cmovle %1, %0\n\t"
"cmp %0, %2\n\t"
"cmovle %2, %0\n\t"
: "+r"(a) :
"%r"(b), "r"(c)
);
return a;
}
NOTE: It may or may not be faster than C code.
This depends on a lot of factors. Usually cmov wins if the branches are not predictable (on some x86 architectures) OTOH inline assembler is always a problem for the optimizer, so the optimization penalty for the surrounding code may outweight all gains..
Btw Sudhanshu, it would be interesting to hear how this code performs with your testdata.
The SSE2 instruction extensions contain an integer min
instruction that can choose 8 minimums at a time. See _mm_mulhi_epu16
in http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm
First, look at the disassembly. That'll tell you a lot. For example, as written, there are 2 if-statements (which means there are 2 possible branch mispredictions), but my guess is that a decent modern C compiler will have some clever optimization that can do it without branching. I'd be curious to find out.
Second, if your libc has special built-in min/max functions, use them. GNU libc has fmin/fmax for floating-point, for example, and they claim that "On some processors these functions can use special machine instructions to perform these operations faster than the equivalent C code". Maybe there's something similar for uints.
Finally, if you're doing this to a bunch of numbers in parallel, there are probably vector instructions to do this, which could provide significant speedup. But I've even seen non-vector code be faster when using vector units. Something like "load one uint into a vector register, call vector min function, get result out" looks dumb but might actually be faster.
If you are only doing one comparison you might want to unroll the loop manually.
First, see if you can get the compiler to unroll the loop for you, and if you can't, do it yourself. This will at least reduce the loop control overhead...
You could try something like this to save on declaration and unnecessary comparisons:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
if (a < b)
{
if (a < c)
return a;
else
return c;
}
if (b < c)
return b;
else return c;
}
These are all good answers. At the risk of being accused of not answering the question, I would also look at the other 80% of the time. Stackshots are my favorite way to find code worth optimizing, especially if it is function calls that you find out you don't absolutely need.
Yes, post assembly, but my naive optimization is:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) return c;
return m;
}
精彩评论