开发者

Deduplicating identical files using hard links [closed]

开发者 https://www.devze.com 2022-12-11 00:09 出处:网络
Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed. This question is off-topic. It is not currently accepting answers.

Want to improve this question? Update the question so it's on-topic for Stack Overflow.

Closed 9 years ago.

开发者_如何学运维 Improve this question

I have a couple of identical files stored in more than one place on my hard disk. I figure I can save a lot of disk space by hard-linking them to point to the same file. I am a little worried about possibly disastrous side effects.

I guess it does not affect permissions, as those are stored in the respective directories, just like the file name, right? (Update: Apparently, I guessed wrong, permissions are shared, as Carl demonstrates in his answer)

The biggest concern is changes to one file inadvertently also changing the other files. Read-only files should be safe then. And files that can be changed are also okay, if rather than updating within the existing file, a new file gets written. I believe most applications work that way, but probably not all.

Is there anything else to consider?

I am on OS X / HFS+.


Don't use hard links if you want changes to one file not to be reflected in other files. That's the whole point of hard links - multiple directory entries for the same file (same blocks on disk). Changing permissions on one of the names of a hard link changes them on both:

$ touch file
$ ln file link
$ ls -l
total 0
-rw-r--r--  2 owner group  0 Nov 11 16:44 file
-rw-r--r--  2 owner group  0 Nov 11 16:44 link
$ chmod 444 file
$ ls -l
total 0
-r--r--r--  2 owner group  0 Nov 11 16:44 file
-r--r--r--  2 owner group  0 Nov 11 16:44 link

From the ln man page:

A hard link to a file is indistinguishable from the original directory entry; any changes to a file are effectively independent of the name used to reference the file.


I wrote a little script to do just this. I'd only be concerned about permissions if your backup was spanning multiple users or system files.

I had a bunch of old backups on CD's and DVD's, many of which had a lot of redundant data on them. Rather than sift through all that info and delete the duplicates, I took the Time Machine route and made hard links between all the matching files (truly matching content, I took a SHA1 checksum of them all).

Now all my backup volumes look just like they would otherwise and most of the redundant files are history. The one hiccup is that a lot of media files store metadata in the file contents so each version is slightly different. See this article for the python code. No Warranties!!!

Make sure you do mdimport your_backup_dir/ afterwards: Spotlight and Finder get a bit flustered when you do massive data manipulations. I've de-duplicated my 240 GB backup folder in this manner and it took about 45 minutes.

Also note, most OSX apps will break your hard links and save in a new inode, most UNIX'y apps probably will preserve the hard links (except emacs, i hear).


Hardlinks are not generally a best practice. plain old soft/symbolic links (ln -s) should serve just as well.


If your primary goal is to "dedupe Time Machine backups" as you mention in one of the comments, then another option that avoids some of your concerns would be to eliminate the dupes from Time Machine using the Time Machine preferences. You can exclude at the directory or file level.

0

精彩评论

暂无评论...
验证码 换一张
取 消