I am reading the source code for glibc2.9
. Reading the source code for the strcpy
function, the performance is not as good as I expect.
The following is the source code of strcpy
in glibc2.9
:
char * strcpy (char *dest, const char* src)
{
reg_char c;
char *__unbounded s = (char *__unbounded) CHECK_BOUNDS_LOW (src);
const ptrdiff_t off = CHECK_BOUNDS_LOW (dest) - s - 1;
size_t n;
do {
c = *s++;
s[off] = c;
}
while (c != '\0');
n = s - src;
(void) CHECK_BOUNDS_HIGH (src + n);
(void) CHECK_BOUNDS_HIGH (dest + n);
return dest;
}
Because I don't know the reason for using the offset, I did some performance tests by comparing the above code with the following code:
char* my_strcpy(char *dest, const char *src)
{
char *d = dest;
register char c;
do {
c = *src++;
*d++ = c;
} while ('\0' != c);
return dest;
}
As a result, the performance of strcpy
is worse during my tests. I have removed the codes about bound pointer.
Why does the glibc
version use the offsets??
The following is the introduction about the tests.
- platform: x86(Intel(R) Pentium(R) 4), gcc version 4.4.2
- compile flag: No flags, because I don't want any optimisation; The command is
gcc test.c
.
The test code I used is the following:
#include <stdio.h>
#include <stdlib.h>
char* my_strcpy1(char *dest, const char *src)
{
char *d = dest;
register char c;
do {
c = *src++;
*d++ = c;
} while ('\0' != c);
return dest;
}
/* Copy SRC to DEST. */
char *
my_strcpy2 (dest, src)
char *dest;
const char *src;
{
register char c;
char * s = (char *)src;
const int off = dest - s - 1;
do
{
c = *s++;
s[off] = c;
}
while (c != '\0');
return dest;
}
int main()
{
const char str1[] = "test1";
const char str2[] = "test2";
char buf[100];
int i;
for (i = 0; i < 10000000; ++i) {
my_strcpy1(buf, str1);
my_strcpy1(buf, str2);
}
return 0;
}
When using the my_strcpy1
function, the outputs are:
[root@Lnx99 test]#time ./a.out
real 0m0.519s
user 0m0.517s
sys 0m0.001s
[root@Lnx99 test]#time ./a.out
real 0m0.520s
user 0m0.520s
sys 0m0.001s
[root@Lnx99 test]#time ./a.out
real 0m0.519s
user 0m0.516s
sys 0m0.002s
When useing my_strcpy2
, the output is:
[root@Lnx99 test]#time ./a.out
real 0m0.647s
user 0m0.647s
sys 0m0.000s
[root@Lnx99 test]#time ./a.out
real 0m0.642s
user 0m0.638s
sys 0m0.001s
[root@Lnx99 test]#time ./a.out
real 0m0.639s
user 0m0.638s
sys 0m0.002s
I know it is not very accurate with the command time
. But I could get the answer from the user time.
Update:
To remove the cost used to calculate the offset, I removed some code and added a global variable.
#include <stdio.h>
#include <stdlib.h>
char* my_strcpy1(char *dest, const char *src)
{
char *d = dest;
register char c;
do {
c = *src++;
*d++ = c;
} while ('\0' != c);
return dest;
}
int off;
/* Copy SRC to DEST. */
char *
my_strcpy2 (dest, src)
char *dest;
const char *src;
{
register char c;
char * s = (char *)src;
do
{
c = *s++;
s[off] = c;
}
while (c != '\0');
return dest;
}
int main()
{
const char str1[] = "test1test1test1test1test1test1test1test1";
char buf[100];
off = buf-str1-1;
int i;
for (i = 0; i < 10000000; ++i) {
my_strcpy2(buf, str1);
}
return 0;
}
But the performance of my_strcpy2
is still worse than my_strcpy1
. Then I checked the assembled code but failed to get the answer too.
I also enlarged the size of string and the performance 开发者_StackOverflow中文版of my_strcpy1
is still better than my_strcpy2
It uses the offset method because this eliminates one increment from the loop - the glibc code only has to increment s
, whereas your code has to increment both s
and d
.
Note that the code you're looking at is the architecture-independent fallback implementation - glibc has overriding assembly implementations for many architectures (eg. the x86-64 strcpy()
).
Based on what I'm seeing, I'm not at all surprised that your code is faster.
Look at the loop, both your loop and glibc's loop are virtually identical. But glibc's has a extra code before and after...
In general, simple offsets do not slow down performance because x86 allows a fairly complicated indirect-addressing scheme. So both loops here will probably run at identical speeds.
EDIT: Here's my update with the added info you gave.
Your string size is only 5 characters. Even though the offset method "may" be slightly faster in the long run, the fact that it needs several operations to compute the offset before starting the loop is slowing it down for short strings. Perhaps if you tried larger strings the gap will narrow and possibly vanish altogether.
Here is my own optimization of strcpy
. I think it had 2x-3x speedup vs naive implementation, but it need to be benchmarked.
https://codereview.stackexchange.com/questions/30337/x86-strcpy-can-this-be-shortened/30348#30348
精彩评论