I need to compare two different files of the instance "File" in Java and wan开发者_如何学Ct to do this with a fast hash function.
Idea: - Hashing the 20 first lines in File 1 - Hashing the 20 first lines in File 2 - Compare the two hashes and return true if those are equal.
I want to use the "fastest" hash function ever been implemented in Java. Which one would you choose?
If you want speed, do not hash! Especially not a cryptographic hash like MD5. These hashes are designed to be impossible to reverse, not fast to calculate. What you should use is a Checksum - see java.util.zip.Checksum
and its two concrete implementations. Adler32 is extremely fast to compute.
Any method based on checksums or hashes is vulnerable to collisions, but you can minimise the risk by using two different methods in the way RSYNC does.
The algorithm is basically:
- Check file sizes are equal
- Break the files into chunks of size N bytes
- Compute checksum on each pair of matching blocks and compare. Any differences prove files are not the same.
This allows for early detection of a difference. You can improve it by computing two checksums at once with different algorithms, or different block sizes.
More bits in the result mean less chance of a collision, but as soon as you go over 64 bits you are outside what Java (and the computer's CPU) can handle natively and hence get slow, so FNV-1024 is less likely to give you a false negative but is much slower.
If it is all about speed, just use Adler32 and accept that very rarely a difference will not be detected. It really is rare. Checksums like these are used to ensure the internet can spot transmission errors, and how often do you get the wrong data turning up?
It it is all about accuracy really, you will have to compare every byte. Nothing else will work.
If you can compromise between speed and accuracy, there is a wealth of options out there.
If you're comparing two files at the same time on the same system there's no need to hash both of them. Just compare the bytes in both files are equal as you read both. If you're looking to compare them at different times or they're in different places then MD5 would be both fast and adequate. There's not much reason to need a faster one unless you're dealing with really large files. Even my laptop can hash hundreds of megabytes per second.
You also need to hash the whole file if you want to verify they're identical. Otherwise you might as well just check the size and last modified time if you want a really quick check. You could also check the beginning and end of the file if they're just really large and you trust that the middle won't be changing. If you're not dealing with hundreds of megabytes though, you may as well check every byte of each file.
精彩评论