开发者

Is SHA sufficient for checking file duplication? (sha1_file in PHP)

开发者 https://www.devze.com 2022-12-20 11:50 出处:网络
Suppose you wanted to make a file hosting site for people to upload their files and send a link to their fri开发者_StackOverflow社区ends to retrieve it later and you want to insure files are duplicate

Suppose you wanted to make a file hosting site for people to upload their files and send a link to their fri开发者_StackOverflow社区ends to retrieve it later and you want to insure files are duplicated where we store them, is PHP's sha1_file good enough for the task? Is there any reason to not use md5_file instead?

For the frontend, it'll be obscured using the original file name store in a database but some additional concerns would be if this would reveal anything about the original poster. Does a file inherit any meta information with it like last modified or who posted it or is this stuff based in the file system?

Also, is using a salt frivolous since security in regards of rainbow table attack mean nothing to this and the hash could later be used as a checksum?

One last thing, scalability? initially, it's only going to be used for small files a couple of megs big but eventually...

Edit 1: The point of the hash is primarily to avoid file duplication, not to create obscurity.


sha1_file good enough?

Using sha1_file is mostly enough, there is a very small chance of collision, but that that will almost never happen. To reduce the chance to almost 0 compare file sizes too:

function is_duplicate_file( $file1, $file2)
{   
    if(filesize($file1) !== filesize($file2)) return false;

    if( sha1_file($file1) == sha1_file($file2) ) return true;

    return false;
}

md5 is faster than sha1 but it generates less unique output, the chance of collision when using md5 is still very small thought.

Scalability?

There are are several methods to compare files, which method to use depends on what your performance concerns are, I made small test on different methods:

1- Direct file compare:

if( file_get_contents($file1) != file_get_contents($file2) )

2- Sha1_file

if( sha1_file($file1) != sha1_file($file2) )

3- md5_file

if( md5_file($file1) != md5_file($file2) )

The results: 2 files 1.2MB each were compared 100 times, I got the following results:

--------------------------------------------------------
 method                  time(s)           peak memory
--------------------------------------------------------
file_get_contents          0.5              2,721,576
sha1_file                  1.86               142,960
mdf5_file                  1.6                142,848

file_get_contents was the fastest 3.7 faster than sha1, but it is not memory efficient.

Sha1_file and md5_file are memory efficient, they used about 5% of the memory used by file_get_contents.

md5_file might be a better option because it is a little faster than sha1.

So the conclusion is that it depends, if you want faster compare, or less memory usage.


As per my comment on @ykaganovich's answer, SHA1 is (surprisingly) slightly faster than MD5.

From your description of the problem, you are not trying to create a secure hash - merely hide the file in a large namespace - in which case use of a salt / rainbow tables are irrelevant - the only consideration is the likelihood of a false collision (where 2 different files give the same hash). The probability of this happening with md5 is very, very remote. It's even more remote with sha1. However you do need to think about what happens when 2 independent users upload the same warez to you site. Who owns the file?

In fact, there doesn't seem to be any reason at all to use a hash - just generate a sufficiently long random value.


SHA should do just fine in any "normal" environment. Although this is what Ben Lynn - the author of "Git Magic" has to say:

A.1. SHA1 Weaknesses As time passes, cryptographers discover more and more SHA1 weaknesses. Already, finding hash collisions is feasible for well-funded organizations. Within years, perhaps even a typical PC will have enough computing power to silently corrupt a Git repository. Hopefully Git will migrate to a better hash function before further research destroys SHA1.

You can always check SHA256, or others which are even longer. Finding MD5 collision is easier than with SHA1.


Both should be fine. sha1 is a safer hash function than md5, which also means it's slower, which probably means you should use md5 :). You still want to use salt to prevent plaintext/rainbow attacks in case of very small files (don't make assumptions about what people decide to upload to your site). The performance difference will be negligible. You can still use it as a checksum as long as you know the salt.

With respect to scalability, I'd guess that you'll likely going to be IO-bound, not CPU-bound, so I don't think calculating the checksum would give you big overhead, esp. if you do it on the stream as it's being uploaded.

0

精彩评论

暂无评论...
验证码 换一张
取 消