Do any common OS file systems use hashes to avoid storing the same content data more than once?_问答_开发者

Many file storage systems use hashes to avoid du开发者_Python百科plication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.

It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?

This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.

Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.

ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup

Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.

It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.

NTFS has single instance storage.

NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).

The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.

All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.

btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.

It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.

Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.

Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.

This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.