开发者

CRC-32(c) okay for large files? (up to 100 MB)

开发者 https://www.devze.com 2023-03-05 01:28 出处:网络
I want to keep checksums for a collection of files in order to notice silent corruption / bit rot, because my filesystem (ext4) doesn\'t care and btrfs isn\'t quite trustworthy yet, I think.

I want to keep checksums for a collection of files in order to notice silent corruption / bit rot, because my filesystem (ext4) doesn't care and btrfs isn't quite trustworthy yet, I think.

The files are up to about 100 MB in size each, but usually around 2 - 10 MB. Is CRC-32(c) alright for this use? Which one is safer? (Maybe scrap the CRCs all together and use MD4 instead?) The paper "32-Bit Cyclic Redundancy Codes for Internet Applications" introducing CRC-32c only considers messages up to 128 KiB:

http://www.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf

I'd like to avoid breaking the files up in little blocks and has开发者_开发技巧hing those.


CRC-32 or 32c should be fine. For better defense without being significantly more expensive to compute (especially on a 64-bit platform) I would use a 64-bit CRC (CRC-64). These can be found on the wikipedia page or by googling. If you are concerned about corruption rather than malice then MD5 and SHA512 are not any better than CRCs, and are much much slower to compute.


Depends on what you mean by "safer" and how paranoid you are.

If I wanted to do similar, I'd pick md5 or sha512.

Happily, there are already applications to do this, like tripwire.


If you are fine with only single bit errors being detectable (just one bit flip), then CRC32 is acceptable and in fact better than most 32-bit sized hashes. CRC provides better error-detection only on small packets of data (usually 7000 bytes maximum!).

However, single bit errors could be equally detected by just a single parity bit, or a CRC-8, so CRC-32 is sub-optimal for this use case.

However, if you wish to protect against other kinds of corruption you may want to use a MD4 or similarly sized (128-bit) hash such as MurmurHash3. But understand that these have no guarantees of error detection, rather you are relying on the chance of collision which becomes lower the more bits contained in the hash. For example, if you can afford using a 256-bit hash, there's a very low chance that a "bit rot" will go undetected, because 256-bits provide a very low chance of collision.

A potentially useful solution could be to combine a CRC-8 or CRC-16 with any 64-bit, 96-bit or 128-bit hash (you can also just truncate any 128-bit hash to the desired size). You are guaranteed to detect single bit errors (something pseudo-random hashes usually do not do), as well as the collision resistance afforded by the hash size you've chosen.

0

精彩评论

暂无评论...
验证码 换一张
取 消