Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?
An operation can be:
- Create a new, empty folder
- Create a new file with any contents
- Delete a file
- Delete an empty folder
- Rename a file
- Rename a folder
- Move a file inside another existing folder
- Move a folder inside another existing folder
A and B are identical when they will have the same files with t开发者_运维知识库he same contents (or same size same CRC) and same name, in the same folder structure.
This question has been puzzling me for some time. For the moment I have the following, basic idea:
- Compute a database:
- Store file names and their CRCs
- Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
- Ascend the tree to make a CRC for each parent folder
- Use the following loop having database A and database B:
- Compute A ∩ B and remove this intersection from both databases.
- Use an inner join to find matching CRCs in A and B, folders first, order by size desc
- while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
- Then remove all files and folders referenced in database A and create those referenced in database B.
However I think that this is really a suboptimal way to do that. What could you give me as advice?
Thank you!
This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.
That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.
Hope this helps! And thanks for posting an awesome question!
You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.
https://github.com/irskep/sleepytree (code and paper)
The first step to do is figure out which files need to be created/renamed/deleted.
- A) Create a hash map of the files of Tree A
- B) Go through the files of Tree B
- B.1) If there is an identical (name and contents) file in the hash map, then leave it alone
- B.2) If the contents are the identical but the name is different, rename the file to that in the hash map
- B.3) If the file contents doesn't exist in the hash map, remove it
- B.4) (if one of 1,2,3 was true) Remove the file from the hash map
The files left over in the hash map are those that must be created. This should be the last step, after the directory structure has been resolved.
After the file differences have been resolved, it get's rather tricky. I wouldn't be surprised if there isn't an efficient optimal solution to this problem (NP-complete/hard).
The difficulty lies in that the problem doesn't naturally subdivide itself. Each step you do must consider the entire file tree. I'll think about it some more.
EDIT: It seems that the most studied tree edit distance algorithms consider only creating/deleting nodes and relabeling of nodes. This isn't directly applicable to this problem because this problem allows moving entire subtrees around which makes it significantly more difficult. The current fastest run-time for the "easier" edit distance problem is O(N^3)
. I'd imagine the run-time for this will be significantly slower.
Helpful Links/References
An Optimal Decomposition Algorithm for Tree Edit Distance - Demaine, Mozes, Weimann
Enumerate all files in B and their associated sizes and checksums; sort by size/checksum.
Enumerate all files in A and their associated sizes and checksums; sort by size/checksum.
Now, doing an ordered list comparison, do the following:
a. for every file in A but not B, delete it.
b. for every file in B but not A, create it.
c. for every file in A and B, rename as many as you encounter from A to B, then make copies of the rest in B. If you are going to overwrite an existing file, save it off to the side in a separate list. If you find A in that list, use that as the source file.
Do the same for directories, deleting ones in A but not in B and adding those in B but not in A.
You iterate by checksum/size to ensure you never have to visit files twice or worry about deleting a file you will later need to resynchronize. I'm assuming you are trying to keep two directories in sync without unnecessary copying?
The overall complexity is O(N log N) plus however long it takes to read in all those files and their metadata.
This isn't the tree edit distance problem; it's more of a list synchronization problem that happens to generate a tree.
Only non trivial problem is moving folders and files. Renaming, deleting and creating is trivial and can be done in first step (or better in last when you finish).
You can then transform this problem into problem whit transforming tree both whit same leafs but different topology.
You decide which files will be moved from some folder/bucket and which files will be left in folder. Decision is based on number of same files in source and destination.
You apply same strategy to move folders in new topology.
I think that you should be near optimal or optimal if you forget about names of folders and think just about files and topology.
精彩评论