I have two tree structures that represent snapshots of a directory structure at two different points in time. Directories may have been added, removed or modified between the snapshots. I need to walk the two trees simultaneously and ma开发者_开发知识库rk the newer with the differences between the two - i.e. flag nodes as New, Modified, Deleted, Unchanged, adding any deleted nodes, so that the end result is the full superset of the two snapshots.
Typically, the trees are likely to be about 10 deep but very wide, containing hundreds of thousands, potentially millions of nodes. I want to skip large chunks of the trees by comparing hash codes at each node and only continuing to recurse where the codes don't match.
Is there an algorithm that could be my friend here? Any other advice?
Imagine unrolling each tree into a sorted list of files and directories. A method could obtain the next input from each unrolled tree from an interator for that tree. I could then compare the hash codes and skip ahead on one tree or another, note deletions, and note modifications.
The paper "Fast and Simple XML Tree Differencing by Sequence Alignment" by Lindholm, Kangasharju, and Tarkoma has some pointers:
1) rsync does the sort of thing you are interested in. Have a look at http://samba.anu.edu.au/ftp/rsync/rsync.html, and it might be worth checking to see if rsync --list-only does what it sounds like.
2) One trick is to turn the tree hierarchy into a sequence, by traversing it with a depth first search, and then compare the two sequences. Your idea about comparing hash codes can then be implemented by using a rolling hash (http://en.wikipedia.org/wiki/Rolling_hash).
I suspect that you will end up generating two entire sequences and then running some equivalent of diff or xdelta between them, rather than trying to do the job incrementally. A completely incremental approach might have problems when some sub-directory is moved a long way in the tree structure.
精彩评论