Using SuperFastHash instead of CRC32?_问答_开发者

Note : I'm not trying to use SuperFastHash and expecting it to give the same output values as CRC32.

I'm writing a simple LZSS compression/decompression routine to provide very fast decompression and no memory overhead when decompressing. Input data is split into blocks of 4096-bytes length, and compressed sequentially.

My problem : I want to add some error detection for each compressed block (block size <= 4096 bytes). The time constraint is drastic, and therefore the checksum routine must be very fast. I avoided the cryptographic algorithms (MD5, SHA1) because they involve a lot of computations, and I chose CRC32 (a simpler and obvious algorithm).

After making some tests, I found CRC32 too slow regarding my project constraints. I used enwik9 (10^9 bytes text dump of wikipedia) from here. I compressed it using my LZSS routine and obtained a 570Mb file. I measured the following durations (single threaded, disk IO excl开发者_运维百科uded, all data loaded in memory before processing, average of 10 trials) :

|          Operation            |  Time (GCC4.4.5/Linux)   |  Time (MSVC2010/Win7)  |
|-------------------------------+--------------------------+------------------------|
|        Decompression          |        6.8 seconds       |      6.95 seconds      |
|  CRC32 on decompressed result |        4.9 seconds       |      4.62 seconds      |
|   CRC32 on compressed result  |        2.8 seconds       |      2.69 seconds      |

Then I tested SuperFastHash, just by curiosity :

|          Operation            |  Time (GCC4.4.5/Linux)   |  Time (MSVC2010/Win7)  |
|-------------------------------+--------------------------+------------------------|
|  SFH on decompressed result   |        1.1 seconds       |      1.33 seconds      |
|   SFH on compressed result    |        0.7 seconds       |      0.75 seconds      |

And here is my CRC32 implementation (I followed the descriptions from the following document : http://www.ross.net/crc/download/crc_v3.txt) :

# include <stdint.h>

// CRC32 lookup table (corresponding to the polynom 0x04C11DB7)
static const uint32_t  crc32_lookup_table[256] =
{
    0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
    0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
    0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
    // many lines skipped
    // ...
    0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D
} ;

uint32_t crc32_hash(const uint8_t * data, size_t len)
{
    uint32_t crc32_register = 0xFFFFFFFF ;
    while( len-- )
    {
        crc32_register = (crc32_register >> 8)
                       ^ crc32_lookup_table[(crc32_register & 0x000000FF) ^ *data++] ;
    }
    return crc32_register ^ 0xFFFFFFFF ;
}

My question is :

Can I use a hash instead of a cyclic redundancy check value to perform error detection in compressed data blocks ? As far as I know (and I remember from my electronics course), CRC algorithms are designed to be very efficient when errors occur in bursts when data is transmitted over a noisy channel, which is not the case of data read from hard drives. Please correct me if I'm wrong.

Thanks for any advice !

SuperFastHash has been found to have some problems, along with fellow fast hashing function murmur2. If your looking for something tuned for bigger data blocks with low collisions, you can try the 128 bit variants of google's city hash (http://code.google.com/p/cityhash/ ) or murmur3. There are also some more off the wall ones like crap8 and crapwow which claim to provide almost perfect bit avalanching and funneling and thus have almost zero collisions, you can read up on it and other noncrypto hash functions here: http://www.team5150.com/~andrew/noncryptohashzoo/

Since your problem is not about security, you can use "broken" cryptographic hash functions, which are not secure against a sentient attacker, but still very good at detecting transmission errors. I am thinking about MD4, which has been measured to be faster than CRC32 on some platforms. You may also want to check RadioGatún and Panama; see this library for opensource implementations in C and Java of various cryptographic hash functions.

If your target architecture is a recent/big enough x86 CPU which features the AES-NI instructions, then you could make a devilishly fast and very good checksum by simply computing a CBC-MAC with block cipher AES and a conventional key (e.g. an all-zero key); since this is not for security, you could even use less rounds than standard AES (e.g. 5 rounds instead of the standard 10).

Hashes are designed to result in a great change of the outcome even with very small alterations on the input.

I think that the SuperFastHash has this propertie. It might be a bit more vulnerable to collisions (since it seems to be less analyzed by the community), but it shouldn't prevent the use that you have in mind.

Good luck :)