why the performance of strcpy in glibc is worse?_问答_开发者

I am reading the source code for glibc2.9. Reading the source code for the strcpy function, the performance is not as good as I expect.

The following is the source code of strcpy in glibc2.9:

   char * strcpy (char *dest, const char* src)
    {
        reg_char c;
        char *__unbounded s = (char *__unbounded) CHECK_BOUNDS_LOW (src);
        const ptrdiff_t off = CHECK_BOUNDS_LOW (dest) - s - 1;
        size_t n;

        do {
            c = *s++;
            s[off] = c;
        }
        while (c != '\0');

        n = s - src;
        (void) CHECK_BOUNDS_HIGH (src + n);
        (void) CHECK_BOUNDS_HIGH (dest + n);

        return dest;
    }

Because I don't know the reason for using the offset, I did some performance tests by comparing the above code with the following code:

char* my_strcpy(char *dest, const char *src)
{
    char *d = dest;
    register char c;

    do {
        c = *src++;
        *d++ = c;
    } while ('\0' != c);

    return dest;
}

As a result, the performance of strcpy is worse during my tests. I have removed the codes about bound pointer.

Why does the glibc version use the offsets??

The following is the introduction about the tests.

platform: x86(Intel(R) Pentium(R) 4), gcc version 4.4.2
compile flag: No flags, because I don't want any optimisation; The command is gcc test.c.

The test code I used is the following:

#include <stdio.h>
#include <stdlib.h>

char* my_strcpy1(char *dest, const char *src)
{
    char *d = dest;
    register char c;

    do {
        c = *src++;
        *d++ = c;
    } while ('\0' != c);

    return dest;
}

/* Copy SRC to DEST. */
char *
my_strcpy2 (dest, src)
     char *dest;
     const char *src;
{
  register char c;
  char * s = (char *)src;
  const int off = dest - s - 1;

  do
    {
      c = *s++;
      s[off] = c;
    }
  while (c != '\0');

  return dest;
}

int main()
{
    const char str1[] = "test1";
    const char str2[] = "test2";
    char buf[100];

    int i;
    for (i = 0; i < 10000000; ++i) {
        my_strcpy1(buf, str1);
        my_strcpy1(buf, str2);
    }

    return 0;
}

When using the my_strcpy1 function, the outputs are:

[root@Lnx99 test]#time ./a.out

real    0m0.519s
user    0m0.517s
sys     0m0.001s
[root@Lnx99 test]#time ./a.out

real    0m0.520s
user    0m0.520s
sys     0m0.001s
[root@Lnx99 test]#time ./a.out

real    0m0.519s
user    0m0.516s
sys     0m0.002s

When useing my_strcpy2, the output is:

[root@Lnx99 test]#time ./a.out

real    0m0.647s
user    0m0.647s
sys     0m0.000s
[root@Lnx99 test]#time ./a.out

real    0m0.642s
user    0m0.638s
sys     0m0.001s
[root@Lnx99 test]#time ./a.out

real    0m0.639s
user    0m0.638s
sys     0m0.002s

I know it is not very accurate with the command time. But I could get the answer from the user time.

Update:

To remove the cost used to calculate the offset, I removed some code and added a global variable.

#include <stdio.h>
#include <stdlib.h>

char* my_strcpy1(char *dest, const char *src)
{
    char *d = dest;
    register char c;

    do {
        c = *src++;
        *d++ = c;
    } while ('\0' != c);

    return dest;
}


int off;

/* Copy SRC to DEST. */
char *
my_strcpy2 (dest, src)
     char *dest;
     const char *src;
{
  register char c;
  char * s = (char *)src;

  do
    {
      c = *s++;
      s[off] = c;
    }
  while (c != '\0');

  return dest;
}

int main()
{
    const char str1[] = "test1test1test1test1test1test1test1test1";
    char buf[100];

    off = buf-str1-1;

    int i;
    for (i = 0; i < 10000000; ++i) {
        my_strcpy2(buf, str1);
    }

    return 0;
}

But the performance of my_strcpy2 is still worse than my_strcpy1. Then I checked the assembled code but failed to get the answer too.

I also enlarged the size of string and the performance 开发者_StackOverflow中文版of my_strcpy1 is still better than my_strcpy2

It uses the offset method because this eliminates one increment from the loop - the glibc code only has to increment s, whereas your code has to increment both s and d.

Note that the code you're looking at is the architecture-independent fallback implementation - glibc has overriding assembly implementations for many architectures (eg. the x86-64 strcpy()).

Based on what I'm seeing, I'm not at all surprised that your code is faster.

Look at the loop, both your loop and glibc's loop are virtually identical. But glibc's has a extra code before and after...

In general, simple offsets do not slow down performance because x86 allows a fairly complicated indirect-addressing scheme. So both loops here will probably run at identical speeds.

EDIT: Here's my update with the added info you gave.

Your string size is only 5 characters. Even though the offset method "may" be slightly faster in the long run, the fact that it needs several operations to compute the offset before starting the loop is slowing it down for short strings. Perhaps if you tried larger strings the gap will narrow and possibly vanish altogether.

Here is my own optimization of strcpy. I think it had 2x-3x speedup vs naive implementation, but it need to be benchmarked.

https://codereview.stackexchange.com/questions/30337/x86-strcpy-can-this-be-shortened/30348#30348