When can I get better performance using memcpy
or how do I benefit from usi开发者_如何学运维ng it?
For example:
float a[3]; float b[3];
is code:
memcpy(a, b, 3*sizeof(float));
faster than this one?
a[0] = b[0];
a[1] = b[1];
a[2] = b[2];
Efficiency should not be your concern.
Write clean maintainable code.
It bothers me that so many answers indicate that the memcpy() is inefficient. It is designed to be the most efficient way of copy blocks of memory (for C programs).
So I wrote the following as a test:
#include <algorithm>
extern float a[3];
extern float b[3];
extern void base();
int main()
{
base();
#if defined(M1)
a[0] = b[0];
a[1] = b[1];
a[2] = b[2];
#elif defined(M2)
memcpy(a, b, 3*sizeof(float));
#elif defined(M3)
std::copy(&a[0], &a[3], &b[0]);
#endif
base();
}
Then to compare the code produces:
g++ -O3 -S xr.cpp -o s0.s
g++ -O3 -S xr.cpp -o s1.s -DM1
g++ -O3 -S xr.cpp -o s2.s -DM2
g++ -O3 -S xr.cpp -o s3.s -DM3
echo "=======" > D
diff s0.s s1.s >> D
echo "=======" >> D
diff s0.s s2.s >> D
echo "=======" >> D
diff s0.s s3.s >> D
This resulted in: (comments added by hand)
======= // Copy by hand
10a11,18
> movq _a@GOTPCREL(%rip), %rcx
> movq _b@GOTPCREL(%rip), %rdx
> movl (%rdx), %eax
> movl %eax, (%rcx)
> movl 4(%rdx), %eax
> movl %eax, 4(%rcx)
> movl 8(%rdx), %eax
> movl %eax, 8(%rcx)
======= // memcpy()
10a11,16
> movq _a@GOTPCREL(%rip), %rcx
> movq _b@GOTPCREL(%rip), %rdx
> movq (%rdx), %rax
> movq %rax, (%rcx)
> movl 8(%rdx), %eax
> movl %eax, 8(%rcx)
======= // std::copy()
10a11,14
> movq _a@GOTPCREL(%rip), %rsi
> movl $12, %edx
> movq _b@GOTPCREL(%rip), %rdi
> call _memmove
Added Timing results for running the above inside a loop of 1000000000
.
g++ -c -O3 -DM1 X.cpp
g++ -O3 X.o base.o -o m1
g++ -c -O3 -DM2 X.cpp
g++ -O3 X.o base.o -o m2
g++ -c -O3 -DM3 X.cpp
g++ -O3 X.o base.o -o m3
time ./m1
real 0m2.486s
user 0m2.478s
sys 0m0.005s
time ./m2
real 0m1.859s
user 0m1.853s
sys 0m0.004s
time ./m3
real 0m1.858s
user 0m1.851s
sys 0m0.006s
You can use memcpy
only if the objects you're copying have no explicit constructors, so as their members (so-called POD, "Plain Old Data"). So it is OK to call memcpy
for float
, but it is wrong for, e.g., std::string
.
But part of the work has already been done for you: std::copy
from <algorithm>
is specialized for built-in types (and possibly for every other POD-type - depends on STL implementation). So writing std::copy(a, a + 3, b)
is as fast (after compiler optimization) as memcpy
, but is less error-prone.
Compilers specifically optimize memcpy
calls, at least clang & gcc does. So you should prefer it wherever you can.
Use std::copy()
. As the header file for g++
notes:
This inline function will boil down to a call to @c memmove whenever possible.
Probably, Visual Studio's is not much different. Go with the normal way, and optimize once you're aware of a bottle neck. In the case of a simple copy, the compiler is probably already optimizing for you.
Don't go for premature micro-optimisations such as using memcpy like this. Using assignment is clearer and less error-prone and any decent compiler will generate suitably efficient code. If, and only if, you have profiled the code and found the assignments to be a significant bottleneck then you can consider some kind of micro-optimisation, but in general you should always write clear, robust code in the first instance.
The benefits of memcpy? Probably readability. Otherwise, you would have to either do a number of assignments or have a for loop for copying, neither of which are as simple and clear as just doing memcpy (of course, as long as your types are simple and don't require construction/destruction).
Also, memcpy is generally relatively optimized for specific platforms, to the point that it won't be all that much slower than simple assignment, and may even be faster.
Supposedly, as Nawaz said, the assignment version should be faster on most platform. That's because memcpy()
will copy byte by byte while the second version could copy 4 bytes at a time.
As it's always the case, you should always profile applications to be sure that what you expect to be the bottleneck matches the reality.
Edit
Same applies to dynamic array. Since you mention C++ you should use std::copy()
algorithm in that case.
Edit
This is code output for Windows XP with GCC 4.5.0, compiled with -O3 flag:
extern "C" void cpy(float* d, float* s, size_t n)
{
memcpy(d, s, sizeof(float)*n);
}
I have done this function because OP specified dynamic arrays too.
Output assembly is the following:
_cpy:
LFB393:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
pushl %edi
LCFI2:
pushl %esi
LCFI3:
movl 8(%ebp), %eax
movl 12(%ebp), %esi
movl 16(%ebp), %ecx
sall $2, %ecx
movl %eax, %edi
rep movsb
popl %esi
LCFI4:
popl %edi
LCFI5:
leave
LCFI6:
ret
of course, I assume all of the experts here knows what rep movsb
means.
This is the assignment version:
extern "C" void cpy2(float* d, float* s, size_t n)
{
while (n > 0) {
d[n] = s[n];
n--;
}
}
which yields the following code:
_cpy2:
LFB394:
pushl %ebp
LCFI7:
movl %esp, %ebp
LCFI8:
pushl %ebx
LCFI9:
movl 8(%ebp), %ebx
movl 12(%ebp), %ecx
movl 16(%ebp), %eax
testl %eax, %eax
je L2
.p2align 2,,3
L5:
movl (%ecx,%eax,4), %edx
movl %edx, (%ebx,%eax,4)
decl %eax
jne L5
L2:
popl %ebx
LCFI10:
leave
LCFI11:
ret
Which moves 4 bytes at a time.
精彩评论