How safe is it to rely on hashes for file identification?_问答_开发者

I am designing a storage cloud software on top of a LAMP stack.

Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.

Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.

Clients can store files under any string (usually some sort of path and filename).

This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go re开发者_Go百科gister like in Amazon S3 and Google storage.

My plan is to store files as hash of the client side defined path.

This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.

But I am afraid of collisions. I currently think about using SHA1 hashes.

I heard that GIT uses hashes also revision identifiers as well.

I know that the chances of collisions are really really low, but possible.

I just cannot judge this. Would you or would you not rely on hash for this purpose?

I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.

Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .

As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.

One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?

If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.

Hope this is useful...

EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.

Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.

= end of business

I'd rather use hashing for things that were less critical when collisions DO occur ;-)

If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.

Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.