开发者

How can I know where the segment of memory is all Zero

开发者 https://www.devze.com 2022-12-13 18:38 出处:网络
I mean, I malloc a segment of memory, maybe 1k maybe 20bytes.. assume the pointer is pMem How can I know the content pMem refered is all Zero or \\0

I mean, I malloc a segment of memory, maybe 1k maybe 20bytes.. assume the pointer is pMem How can I know the content pMem refered is all Zero or \0 . I know memcmp but the second parameter should another m开发者_运维知识库emory address... thanx


As others have already suggested you probably want memset or calloc.

But if you actually do want to check if a memory area is all zero, you can compare it with itself but shifted by one.

bool allZero = pMem[0] == '\0' && !memcmp(pMem, pMem + 1, length - 1);

where length is the number of bytes you want to be zero.


Since Mark's answer has provoked some controversy:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#ifndef count
    #define count 1000*1000
#endif
#ifndef blocksize
    #define blocksize 1024
#endif

int checkzeros(char *first, char *last) {
    for (; first < last; ++first) {
        if (*first != 0) return 0;
    }
    return 1;
}

int main() {
    int i;
    int zeros = 0;

    #ifdef EMPTY
        /* empty test loop */
        for (i = 0; i < count; ++i) {
            char *p = malloc(blocksize);
            if (*p == 0) ++zeros;
            free(p);
        }
    #endif

    #ifdef LOOP
        /* simple check */
        for (i = 0; i < count; ++i) {
            char *p = malloc(blocksize);
            if (checkzeros(p, p + blocksize)) ++zeros;
            free(p);
        }
    #endif

    #ifdef MEMCMP
        /* memcmp check */
        for (i = 0; i < count; ++i) {
            char *p = malloc(blocksize);
            if (*p == 0 && !memcmp(p, p + 1, blocksize - 1)) ++zeros;
            free(p);
        }
    #endif

    printf("%d\n", zeros);
    return 0;
}

Results (cygwin, Windows XP, Core 2 Duo T7700 at 2.4 GHz):

$ gcc-4 cmploop.c -o cmploop -pedantic -Wall -O2 -DEMPTY && time ./cmploop
1000000

real    0m0.500s
user    0m0.405s
sys     0m0.000s

$ gcc-4 cmploop.c -o cmploop -pedantic -Wall -O2 -DLOOP && time ./cmploop
1000000

real    0m1.203s
user    0m1.233s
sys     0m0.000s

$ gcc-4 cmploop.c -o cmploop -pedantic -Wall -O2 -DMEMCMP && time ./cmploop
1000000

real    0m2.859s
user    0m2.874s
sys     0m0.015s

So, the memcmp is taking approximately (2.8 - 0.4) / (1.2 - 0.4) = 3 times as long, for me. It'd be interesting to see other people's results - all my malloced memory is zeroed, so I'm getting the worst-case time for each comparison, always.

With smaller blocks (and more of them) the comparison time is less significant, but memcmp is still slower:

$ gcc-4 cmploop.c -o cmploop -pedantic -Wall -O2 -DEMPTY -Dblocksize=20 -Dcount=10000000 && time ./cmploop
10000000

real    0m3.969s
user    0m3.780s
sys     0m0.030s

$ gcc-4 cmploop.c -o cmploop -pedantic -Wall -O2 -DLOOP -Dblocksize=20 -Dcount=10000000 && time ./cmploop
10000000

real    0m4.062s
user    0m3.968s
sys     0m0.015s

$ gcc-4 cmploop.c -o cmploop -pedantic -Wall -O2 -DMEMCMP -Dblocksize=20 -Dcount=10000000 && time ./cmploop
10000000

real    0m4.391s
user    0m4.296s
sys     0m0.015s

I am slightly surprised by this. I expected memcmp to at least compete, since I expect it to be inlined and to be optimised for small sizes known at compile time. Even changing it so that it tests an int at the start and then 16 bytes of memcmp, to avoid the unaligned worst case, doesn't speed it up much.


If you're testing it, and then only going to use it if it's zero, then be aware you have a race-condition, because the method suggested by @Mark Byers doesn't have an atomic test/set operation. It'll be hard to get the logic correct in this case.

If you want to zero it if it's not already zero, then just set it to zero as that will be faster.


C++ solution:

bool all_zeroes = 
  (find_if( pMem, pMem+len, bind2nd(greater<unsigned char>(), 0)) == (pMem+len));


As you've noted, memcmp compares one block of memory to another. If you had another block of memory that you already knew was all zero, then you could use that reference block to compare with your candidate block to see whether they match.

It sounds like you don't have another block of memory, though. You just have one, and you want to know whether it's all zero. The standard libraries don't provide such a function, but it's easy to write your own:

bool is_all_zero(char const* mem, size_t size)
{
  while (size-- > 0)
    if (*mem++)
      return false;
  return true;
}

If you want to allocate a new block of memory and have it be all zero right away, then use calloc instead of malloc. If you have a block of memory that you want to set to all zero, then use memset or std::fill.


For large buffers:

typedef int tNativeInt;  // __int64 on 64 bit platforms with 32 bit ints

// test most of the buffer using CPU Word size
tNativeInt * ptr = reinterpret_cast<tNativeInt *>(buffer);
tNativeInt * end = ptr + bufSize / sizeof(tNativeInt);
for(;ptr < end;++ptr)
  if (*ptr != 0)
    return false;

// check remainder
char * ptrc = reinterpret_cast<char *>(ptr);
char * endc = ptrc + bufSize % sizeof(tNativeInt);
for(; ptrc<endc; ++ptrc)
  if (*ptrc != 0)
    return false;

Remarks:
The core optimizations testing full CPU words - reading one is generally as expensive as a single byte.

The code assumes the buffer is well aligned (i.e. the address is a multiple of the CPU Word size). If not, a block similar to "remainder" needs to be put in front of the block test.

The additional code will of course be slower for small buffers - however, in this case one commonly assumes that these short durations don't matter.

If you prefer you can of course replace the loops with std::find_if.


Performance: 1: 3.9

(VS 2008, /Ox /Ot, 2,47 +/- 0,11 vs. 0,63 +/- 0,19, for 10000 repeats over 256000 bytes, statistics over 15 repeats first removed)

Discussion: From my experience analyzing C/C++ to assembly, I wouldn't expect a compiler to do this optimization - because it is a pessimization for small size, and optimization possibilities of that type are few and far between. The factor of roughly 4 confirms the assumpotion - as does looking at the disassembly.

Also, the total time is negligible for most applications, a cache miss will be far worse and affect both versions the same.

[edit] for the fun of it:

Overlapping memcmp clocks in at 1:1.4, which is vastly better than the single byte check (surprised me a little).

Note that the misaligned read makes that strongly platform dependent.


You can use calloc if you want all 0s in the memory you allocate


If you need the memory to be zero, just memset() it to zero; it's a waste of time to check whether it's zero first, because it probably isn't and you'll end up doing more work overall, checking and then setting.

However, if you really want to efficiently check a region of memory to see if it's all zero, you can memcmp() it against something else that's known to be all zero. You could allocate and zero one chunk of memory, for example, and then keep a pointer to it for use in comparing against other chunks of memory.


Another C++ solution, slightly simpler than Kirill's. It is somewhat less efficient, though, as it traverses the entire sequence even when that sequence contains a non-zero element.

bool all_zeroes = std::count(pMem, pMem + length, 0) == length;


More benchmarks:

Roughly based on Steve Jessop's sample. I've tested the following:

  • Naive memcmp (allocate a separate block of zeros, and compare against that)
  • "clever" memcmp ( compare the block against itself, shifted once)
  • std::find_if
  • A simple loop checking each byte

None of these make any assumptions about the array, and simply accept and work on an array of chars.

Finally, I made a fifth version, which casts the array to ints, and compares those. (This one obviously assumes that the size of the array is divisible by sizeof(int), so it is less general. But I added it to demonstrate that working with a reasonable chunk size is a much more useful optimization than messing around with memcpy and byte-wise comparisons.

Oh, and note that I just slapped this test together quickly, and I used one of Windows' timers because I'm lazy. :)

#include <cstdlib>
#include <string>
#include <cstdio>
#include <algorithm>

#include <windows.h>

enum {
    count = 1000*1000,
    blocksize = 1024
};

bool test_simple_loop(char* p){
    for (int i = 0; i < blocksize; ++i) {
        if (p[i] != 0) { return false; }
    }
    return true;
}

bool test_memcmp_clever(char* p){
    return *p == 0 && memcmp(p, p + 1, blocksize - 1) == 0;
}
bool test_memcmp_naive(char* p, char* ref){
    return memcmp(p, ref, blocksize) == 0;
}

struct cmp {
    template <typename T>
    bool operator()(T& x) {
        return x != 0;
    }
};
bool test_find_if(char* p){
    return std::find_if(p, p+blocksize, cmp()) == p+blocksize;
}

bool test_find_if_big(int* p){
    return std::find_if(p, p+blocksize, cmp()) == p+blocksize;
}

int main() {

    bool res = true;

    char *p = new char[blocksize];
    char *ref = new char[blocksize];

    std::fill(ref, ref+blocksize, 0);
    std::fill(p, p+blocksize, 0); // ensure the worst-case scenario, that we have to check the entire buffer. This will also load the array into CPU cache so the first run isn't penalized

    DWORD times[5];
    DWORD start;

    start = GetTickCount();
    for (int i = 0; i != count; ++i) {
        res &= test_memcmp_naive(p, ref);
    }
    times[0] = GetTickCount() - start;
    start = GetTickCount();
    for (int i = 0; i != count; ++i) {
        res &= test_memcmp_clever(p);
    }
    times[1] = GetTickCount() - start;
    start = GetTickCount();
    for (int i = 0; i != count; ++i) {
        res &= test_find_if(p);
    }
    times[2] = GetTickCount() - start;
    start = GetTickCount();
    for (int i = 0; i != count; ++i) {
        res &= test_simple_loop(p);
    }
    times[3] = GetTickCount() - start;
    start = GetTickCount();
    for (int i = 0; i != count; ++i) {
        res &= test_find_if_big(reinterpret_cast<int*>(p));
    }
    times[4] = GetTickCount() - start;

    delete[] p;
    delete[] ref;

    printf("%d\n", res);
    printf("%d\n%d\n%d\n%d\n%d\n", times[0], times[1], times[2], times[3], times[4]);
}

My results are: (in milliseconds, for a million runs)

Naive memcmp: 546
"Clever" memcmp: 530
`find_if<char>`: 1466
Simple loop: 1358
`find_if<int`>: 343

I think the takeaway point is pretty clear: Anything that does a bytewise comparison is slow. Really slow. Memcmp does more or less ok, but it's far from perfect. It is too general to be optimal.

The most efficient way to solve this problem is to process as much data at a time as possible. char is just silly. int is a good start, but 64- or 128-bit reads would probably perform far better still.


Poking around with Steve Jessops code for fun I found this variant

int checkzeros(char *f, char *l) {
        char a = 0;
        while (f < l) {
                a |= *f++;
        }
        if (a) {
                return 0;
        }
        return 1;
}

to be about 50% faster (no branches in the core loop). All bugs are mine.

[Edit]: As Steve points out, this version has a good worst case and a terrible best case (since they are the same). Only use it if a completely zeroed buffer is the only case that needs to be fast.


repe.S:

.globl repe_scasb
repe_scasb:
#if defined(__i386__)
        push   %edi
        mov    $0x0,%al
        mov    0xc(%esp),%ecx
        mov    0x8(%esp),%edi
        sub    %edi,%ecx
        repe scasb
        pop    %edi
        sete   %al
        ret
#elif defined(__amd64__)
        mov    $0x0,%al
        mov    %rsi,%rcx
        sub    %rdi,%rcx
        repe scasb
        sete   %al
        ret
#else
#  error "repe_scasb not defined for current architecture"
#endif

.globl repe_scas
repe_scas:
#if defined(__i386__)
        push   %edi
        mov    $0x0,%eax
        mov    0xc(%esp),%edx
        mov    0x8(%esp),%edi
        sub    %edi,%edx
repe_scas4:
        test   $0x3,%di
        jnz    repe_scas2
        cmp    $0x4,%edx
        jl     repe_scas2
        mov    %edx,%ecx
        shr    $0x2,%ecx
        repe scasl
        jne    repe_scas0
        and    $0x3,%edx
repe_scas2:
        test   $0x1,%di
        jnz    repe_scas1
        cmp    $0x2,%edx
        jl     repe_scas1
        scasw
        jnz    repe_scas0
        sub    $0x2,%edx
        jmp    repe_scas4
repe_scas1:
        test   %edx,%edx
        jz     repe_scas0
        scasb
        jnz    repe_scas0
        sub    $0x1,%edx
        jmp    repe_scas4
repe_scas0:
        pop    %edi
        sete   %al
        ret
#elif defined(__amd64__)
        mov    $0x0,%eax
        sub    %rdi,%rsi
repe_scas8:
        test   $0x7,%di
        jnz    repe_scas4
        cmp    $0x8,%rsi
        jl     repe_scas4
        mov    %rsi,%rcx
        shr    $0x3,%rcx
        repe scasq
        jne    repe_scas0
        and    $0x7,%rsi
repe_scas4:
        test   $0x3,%di
        jnz    repe_scas2
        cmp    $0x4,%rsi
        jl     repe_scas2
        scasl
        jnz    repe_scas0
        sub    $0x4,%rsi
        jmp    repe_scas8
repe_scas2:
        test   $0x1,%di
        jnz    repe_scas1
        cmp    $0x2,%rsi
        jl     repe_scas1
        scasw
        jnz    repe_scas0
        sub    $0x2,%rsi
        jmp    repe_scas8
repe_scas1:
        test   %rsi,%rsi
        jz     repe_scas0
        scasb
        jnz    repe_scas0
        sub    $0x1,%rsi
        jmp    repe_scas8
repe_scas0:
        sete   %al
        ret
#else
#  error "repe_scas not defined for current architecture"
#endif

test.c:

#include <inttypes.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

static int compar_double(const void *a, const void *b) {
    double diff = *(const double *)a - *(const double *)b;
    if (diff < 0) return -1;
    if (diff > 0) return 1;
    return 0;
}

static bool bit_or(const void *first, const void *last) {
    uint8_t a;
    for (a = 0; first < last; first = (const uint8_t *)first + 1)
        a |= *(const uint8_t *)first;
    return !a;
}

static bool use_memcmp(const void *first, const void *last) {
    return first >= last
        || !(((uint8_t *)first)[0]
        || memcmp(first, (const uint8_t *)first + 1,
            (const uint8_t *)last - (const uint8_t *)first - 1));
}

static bool check_bytes(const void *first, const void *last) {
    while (first < last) {
        if (*(const uint8_t *)first) return false;
        first = (const uint8_t *)first + 1;
    }
    return true;
}

static bool check_aligned(const void *first, const void *last) {
    switch ((uintptr_t)first & 7) while (first < last) {
        case 0:
            if (last - first >= 8) {
                if (*(const uint64_t *)first) return false;
                first = (const uint64_t *)first + 1;
                continue;
            }
        case 4:
            if (last - first >= 4) {
                if (*(const uint32_t *)first) return false;
                first = (const uint32_t *)first + 1;
                continue;
            }
        case 2: case 6:
            if (last - first >= 2) {
                if (*(const uint16_t *)first) return false;
                first = (const uint16_t *)first + 1;
                continue;
            }
        case 1: case 3: case 5: case 7:
            if (*(const uint8_t *)first) return false;
            first = (const uint8_t *)first + 1;
    }
    return true;
}

bool repe_scasb(const void *, const void *);
bool repe_scas(const void *, const void *);

static const struct {
    const char name[16];
    bool (*fn)(const void *, const void *);
} functions[] = {
    { "bit_or", bit_or },
    { "use_memcmp", use_memcmp },
    { "check_bytes", check_bytes },
    { "check_aligned", check_aligned },
    { "repe_scasb", repe_scasb },
    { "repe_scas", repe_scas },
};

#define REPS 999
#define BYTECYCLES 0xFFFF
void time_functions(const char *s, const void *first, const void *last, bool expect) {
    unsigned i, cycles = BYTECYCLES / (last - first) + 1;
    char seps[sizeof(functions) / sizeof(*functions)];
    double times[sizeof(functions) / sizeof(*functions)][REPS];

    for (i = 0; i < sizeof(functions) / sizeof(*functions); i++) {
        unsigned j;

        seps[i] = '/';
        for (j = 0; j < REPS; j++) {
            unsigned k;
            struct timespec start, finish;
            clock_gettime(CLOCK_MONOTONIC, &start);
            for (k = 0; k < cycles; k++)
                if (functions[i].fn(first, last) != expect) seps[i] = '!';
            clock_gettime(CLOCK_MONOTONIC, &finish);
            times[i][j] = ((finish.tv_sec - start.tv_sec) +
                    (finish.tv_nsec - start.tv_nsec) / 100000000.) / cycles;
        }
    }

    fputs(s, stdout);
    for (i = 0; i < sizeof(functions) / sizeof(*functions); i++) {
        qsort(times[i], REPS, sizeof(double), compar_double);
        printf("|%8.2e%c%8.2e", times[i][REPS / 4], seps[i], times[i][3 * REPS / 4]);
    }
    putchar('\n');
}

static size_t sizes[] = {0x7, 0x7, 0x7, 0x7, 0x400, 0x1000, 0x100000};
static uint8_t storage[0x100000];

int main() {
    unsigned i, j, k;

    fputs("      ", stdout);
    for (i = 0; i < sizeof(functions) / sizeof(*functions); i++)
        printf(" |%16.16s", functions[i].name);
    fputs("\n-------", stdout);
    for (i = 0; i < sizeof(functions) / sizeof(*functions); i++)
        fputs("+-----------------", stdout);
    putc('\n', stdout);

    for (j = 0, k = 8; j < sizeof(sizes) / sizeof(*sizes); j++) {
        char test[8];
        if (k /= 2) snprintf(test, sizeof(test), "0x%zX+%u", sizes[j], k);
        else snprintf(test, sizeof(test), "0x%zX", sizes[j]);
        printf("%-7.7s|\n", test);

        memset(storage + k, 0, sizes[j]);
        time_functions("  zero ", storage + k, storage + k + sizes[j], true);

        storage[k + sizes[j] - 1] = 1;
        time_functions("  last ", storage + k, storage + k + sizes[j], false);

        storage[k + sizes[j] / 2] = 1;
        time_functions("  half ", storage + k, storage + k + sizes[j], false);

        memset(storage + k, ~0, sizes[j]);
        time_functions("  first", storage + k, storage + k + sizes[j], false);
    }

    return 0;
}

Makefile:

CFLAGS ?= -g -O3 -W -Wall -Wextra
LIBS = -lrt
SRC := test.c repe.S
all: test32 test64
test32: $(SRC)
        $(CC) -m32 $(CFLAGS) $(LDFLAGS) -o $@ $^ $(LIBS)
test64: $(SRC)
        $(CC) -m64 $(CFLAGS) $(LDFLAGS) -o $@ $^ $(LIBS)
time: time32 time64
time32: test32
        ./test32
time64: test64
        ./test64
.PHONY: all time time32 time64

Instead of just the average, this test shows the 1st quartile and 3rd quartile of time taken, but it seems the system was fairly consistent (runs varying within ±2%).

$ make time
cc -m32 -g -O3 -W -Wall -Wextra  -o test32 test.c repe.S -lrt
./test32
       |          bit_or |      use_memcmp |     check_bytes |   check_aligned |      repe_scasb |       repe_scas
-------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------
0x7+4  |
  zero |1.09e-07/1.09e-07|2.09e-07/2.11e-07|1.66e-07/1.68e-07|1.35e-07/1.74e-07|1.83e-07/1.86e-07|2.00e-07/2.06e-07
  last |1.09e-07/1.09e-07|2.09e-07/2.12e-07|1.00e-07/1.00e-07|1.18e-07/1.18e-07|1.83e-07/1.85e-07|1.83e-07/1.85e-07
  half |1.09e-07/1.09e-07|1.79e-07/1.84e-07|7.42e-08/7.42e-08|6.98e-08/6.98e-08|1.57e-07/1.59e-07|1.39e-07/1.40e-07
  first|1.09e-07/1.09e-07|6.11e-08/6.11e-08|4.81e-08/4.81e-08|6.98e-08/6.99e-08|1.22e-07/1.27e-07|1.39e-07/1.42e-07
0x7+2  |
  zero |1.09e-07/1.09e-07|2.09e-07/2.11e-07|1.66e-07/1.71e-07|1.31e-07/1.57e-07|1.83e-07/1.85e-07|2.00e-07/2.06e-07
  last |1.09e-07/1.09e-07|2.09e-07/2.14e-07|1.00e-07/1.00e-07|1.22e-07/1.24e-07|1.83e-07/1.88e-07|1.83e-07/1.83e-07
  half |1.09e-07/1.09e-07|1.79e-07/1.80e-07|7.42e-08/7.42e-08|8.72e-08/8.72e-08|1.57e-07/1.59e-07|1.61e-07/1.66e-07
  first|1.09e-07/1.09e-07|6.11e-08/6.11e-08|4.81e-08/4.81e-08|6.55e-08/6.55e-08|1.22e-07/1.22e-07|5.82e-08/5.82e-08
0x7+1  |
  zero |1.09e-07/1.09e-07|2.09e-07/2.14e-07|1.66e-07/1.66e-07|1.09e-07/1.42e-07|1.83e-07/1.88e-07|2.05e-07/2.07e-07
  last |1.09e-07/1.09e-07|2.09e-07/2.14e-07|1.00e-07/1.00e-07|1.00e-07/1.00e-07|1.83e-07/1.87e-07|1.92e-07/1.97e-07
  half |1.09e-07/1.09e-07|1.79e-07/1.81e-07|7.42e-08/7.42e-08|7.85e-08/7.86e-08|1.57e-07/1.61e-07|1.92e-07/1.97e-07
  first|1.09e-07/1.09e-07|6.11e-08/6.11e-08|4.81e-08/4.81e-08|5.24e-08/5.24e-08|1.22e-07/1.22e-07|6.55e-08/6.55e-08
0x7    |
  zero |1.09e-07/1.09e-07|2.09e-07/2.14e-07|1.66e-07/1.71e-07|1.52e-07/1.79e-07|1.83e-07/1.88e-07|2.00e-07/2.06e-07
  last |1.09e-07/1.09e-07|2.09e-07/2.14e-07|1.00e-07/1.00e-07|1.44e-07/1.70e-07|1.83e-07/1.88e-07|1.83e-07/1.85e-07
  half |1.09e-07/1.09e-07|1.79e-07/1.79e-07|7.42e-08/7.42e-08|7.85e-08/7.85e-08|1.57e-07/1.57e-07|1.39e-07/1.39e-07
  first|1.09e-07/1.09e-07|6.11e-08/6.11e-08|4.81e-08/4.81e-08|7.85e-08/7.85e-08|1.22e-07/1.22e-07|1.39e-07/1.39e-07
0x400  |
  zero |9.06e-06/9.06e-06|9.97e-06/9.97e-06|1.79e-05/1.81e-05|2.93e-06/2.93e-06|9.06e-06/9.07e-06|2.41e-06/2.41e-06
  last |9.06e-06/9.06e-06|9.97e-06/9.97e-06|1.80e-05/1.80e-05|2.39e-06/2.39e-06|9.06e-06/9.07e-06|2.40e-06/2.40e-06
  half |9.06e-06/9.06e-06|5.08e-06/5.08e-06|9.06e-06/9.06e-06|1.29e-06/1.29e-06|4.62e-06/4.62e-06|1.30e-06/1.30e-06
  first|9.06e-06/9.06e-06|9.25e-08/9.67e-08|8.34e-08/9.67e-08|1.05e-07/1.06e-07|1.58e-07/1.58e-07|1.75e-07/1.75e-07
0x1000 |
  zero |3.59e-05/3.59e-05|3.95e-05/3.95e-05|7.15e-05/7.15e-05|1.14e-05/1.14e-05|3.59e-05/3.59e-05|9.20e-06/9.20e-06
  last |3.59e-05/3.59e-05|3.95e-05/3.95e-05|3.59e-05/3.59e-05|9.18e-06/9.18e-06|3.59e-05/3.59e-05|9.19e-06/9.19e-06
  half |3.59e-05/3.59e-05|1.99e-05/1.99e-05|1.81e-05/1.81e-05|4.74e-06/4.74e-06|1.81e-05/1.81e-05|4.74e-06/4.75e-06
  first|3.59e-05/3.59e-05|2.04e-07/2.04e-07|2.04e-07/2.04e-07|2.13e-07/2.13e-07|2.65e-07/2.66e-07|2.82e-07/2.82e-07
0x10000|
  zero |9.52e-03/1.07e-02|1.14e-02/1.17e-02|1.94e-02/2.04e-02|3.43e-03/3.52e-03|9.59e-03/1.07e-02|3.10e-03/3.17e-03
  last |9.57e-03/1.07e-02|1.14e-02/1.17e-02|9.73e-03/1.08e-02|3.04e-03/3.13e-03|9.57e-03/1.05e-02|3.11e-03/3.22e-03
  half |9.54e-03/1.06e-02|5.06e-03/5.13e-03|4.69e-03/4.76e-03|1.17e-03/1.17e-03|4.60e-03/4.65e-03|1.18e-03/1.18e-03
  first|9.55e-03/1.07e-02|2.28e-06/2.29e-06|2.26e-06/2.27e-06|2.28e-06/2.29e-06|2.34e-06/2.35e-06|2.36e-06/2.36e-06
cc -m64 -g -O3 -W -Wall -Wextra  -o test64 test.c repe.S -lrt
./test64
       |          bit_or |      use_memcmp |     check_bytes |   check_aligned |      repe_scasb |       repe_scas
-------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------
0x7+4  |
  zero |9.14e-08/9.14e-08|1.65e-07/1.65e-07|1.17e-07/1.17e-07|1.26e-07/1.26e-07|1.52e-07/1.52e-07|2.57e-07/2.67e-07
  last |9.14e-08/9.14e-08|1.65e-07/1.65e-07|1.04e-07/1.17e-07|1.09e-07/1.09e-07|1.52e-07/1.52e-07|1.70e-07/1.70e-07
  half |9.14e-08/9.14e-08|1.35e-07/1.35e-07|7.83e-08/7.83e-08|5.66e-08/5.66e-08|1.26e-07/1.26e-07|7.83e-08/7.83e-08
  first|9.14e-08/9.14e-08|4.79e-08/4.79e-08|5.23e-08/5.23e-08|5.66e-08/5.66e-08|1.00e-07/1.00e-07|7.83e-08/7.83e-08
0x7+2  |
  zero |9.14e-08/9.14e-08|1.65e-07/1.65e-07|1.17e-07/1.17e-07|1.26e-07/1.26e-07|1.52e-07/1.52e-07|2.30e-07/2.57e-07
  last |9.14e-08/9.14e-08|1.65e-07/1.65e-07|1.04e-07/1.04e-07|1.09e-07/1.09e-07|1.52e-07/1.52e-07|2.22e-07/2.22e-07
  half |9.14e-08/9.14e-08|1.35e-07/1.35e-07|7.83e-08/7.83e-08|7.83e-08/7.83e-08|1.26e-07/1.26e-07|1.09e-07/1.09e-07
  first|9.14e-08/9.14e-08|4.79e-08/4.79e-08|5.23e-08/5.23e-08|5.66e-08/5.66e-08|1.00e-07/1.00e-07|7.40e-08/7.40e-08
0x7+1  |
  zero |9.14e-08/9.14e-08|1.65e-07/1.65e-07|1.17e-07/1.17e-07|1.17e-07/1.17e-07|1.52e-07/1.52e-07|2.30e-07/2.32e-07
  last |9.14e-08/9.14e-08|1.65e-07/1.65e-07|1.04e-07/1.04e-07|1.04e-07/1.13e-07|1.52e-07/1.52e-07|1.52e-07/1.52e-07
  half |9.14e-08/9.14e-08|1.35e-07/1.35e-07|7.83e-08/7.83e-08|7.40e-08/7.40e-08|1.26e-07/1.26e-07|1.52e-07/1.52e-07
  first|9.14e-08/9.14e-08|3.92e-08/3.92e-08|4.36e-08/4.36e-08|4.79e-08/4.79e-08|1.00e-07/1.00e-07|7.83e-08/7.83e-08
0x7    |
  zero |9.14e-08/9.14e-08|1.65e-07/1.65e-07|1.17e-07/1.17e-07|1.52e-07/1.52e-07|1.52e-07/1.52e-07|2.39e-07/2.65e-07
  last |9.14e-08/9.14e-08|1.65e-07/1.65e-07|1.04e-07/1.04e-07|1.26e-07/1.26e-07|1.52e-07/1.52e-07|1.70e-07/1.70e-07
  half |9.14e-08/9.14e-08|1.35e-07/1.35e-07|7.83e-08/7.83e-08|6.53e-08/6.53e-08|1.26e-07/1.26e-07|8.70e-08/8.70e-08
  first|9.14e-08/9.14e-08|4.79e-08/4.79e-08|5.23e-08/5.23e-08|6.53e-08/6.53e-08|1.00e-07/1.00e-07|8.70e-08/8.70e-08
0x400  |
  zero |9.04e-06/9.04e-06|9.90e-06/9.90e-06|9.03e-06/9.05e-06|2.36e-06/2.36e-06|9.01e-06/9.01e-06|1.25e-06/1.25e-06
  last |9.04e-06/9.04e-06|9.90e-06/9.90e-06|9.03e-06/9.03e-06|2.35e-06/2.35e-06|9.01e-06/9.01e-06|1.23e-06/1.23e-06
  half |9.04e-06/9.04e-06|5.02e-06/5.02e-06|4.59e-06/4.59e-06|1.25e-06/1.25e-06|4.57e-06/4.57e-06|6.84e-07/6.84e-07
  first|9.04e-06/9.04e-06|6.19e-08/7.47e-08|7.91e-08/7.92e-08|7.03e-08/7.05e-08|1.14e-07/1.15e-07|1.27e-07/1.28e-07
0x1000 |
  zero |3.61e-05/3.61e-05|3.93e-05/3.93e-05|3.58e-05/3.58e-05|9.08e-06/9.08e-06|3.58e-05/3.58e-05|4.64e-06/4.64e-06
  last |3.61e-05/3.61e-05|3.93e-05/3.93e-05|3.58e-05/3.58e-05|9.07e-06/9.07e-06|3.58e-05/3.58e-05|4.61e-06/4.61e-06
  half |3.61e-05/3.61e-05|1.97e-05/1.97e-05|1.80e-05/1.80e-05|4.63e-06/4.63e-06|1.80e-05/1.80e-05|2.40e-06/2.40e-06
  first|3.61e-05/3.61e-05|1.04e-07/1.17e-07|1.21e-07/1.21e-07|1.26e-07/1.26e-07|1.58e-07/1.58e-07|1.71e-07/1.71e-07
0x10000|
  zero |1.08e-02/1.09e-02|1.03e-02/1.04e-02|9.38e-03/9.50e-03|2.41e-03/2.49e-03|9.33e-03/9.50e-03|1.67e-03/1.73e-03
  last |1.08e-02/1.09e-02|1.03e-02/1.04e-02|9.38e-03/9.49e-03|2.44e-03/2.55e-03|9.33e-03/9.47e-03|1.62e-03/1.67e-03
  half |1.08e-02/1.10e-02|5.05e-03/5.12e-03|4.61e-03/4.69e-03|1.16e-03/1.16e-03|4.59e-03/4.66e-03|6.63e-04/6.65e-04
  first|1.08e-02/1.09e-02|8.70e-07/8.80e-07|8.70e-07/8.80e-07|8.90e-07/9.00e-07|9.60e-07/9.60e-07|9.70e-07/9.80e-07
$ uname -srvmp
Linux 2.6.32-gentoo #1 SMP Sun Dec 6 16:24:50 EST 2009 x86_64 AMD Phenom(tm) 9600 Quad-Core Processor

As you can see,

  • for short data, simple is better
  • memcmp is actually pretty good at crawling through memory (at least with whatever optimizations GCC 4.4.2 applies)
  • x86's string intrinsics help a little
  • the biggest gain on long data comes from taking larger strides — avoid byte-by-byte access if at all possible.


Another C++ solution:

bool all_zero( const char * buf, size_t len )
{
    return (buf == std::search_n( buf, buf+len, len, 0 ) );
}

search_n returns the first occurrence of a sequence of "len" consecutive occurrences of the value (here 0) or it returns "end". In this particular application it obviously either returns the start of the sequence or the end.

The advantage of this is I can also apply it to an array of ints, doubles, etc. which will be able to check word-by-word. (Obviously make all_zero a template to do that).


The function you want is called memset.

Call it like this:

memset(pMem, '\0', iMemSize);
0

精彩评论

暂无评论...
验证码 换一张
取 消