I'm trying to write a program to compare files and show the duplicates in python. Anyone know 开发者_如何转开发any good functions or methods related to this? I am sorta lost...
If you're just looking for exact duplicates, do an MD5 hash on both and see if they match:
import hashlib
file1 = open('file1.avi', 'r').read()
file2 = open('file2.avi', 'r').read()
if hashlib.sha512(file1).hexdigest() == hashlib.sha512(file2).hexdigest():
print 'They are the same'
else:
print 'They are different'
If not, I'd try OpenCV's Python Bindings and check if they match up frame by frame.
I would use os.walk to go through the file tree.
For each file, I would store the absolutepath+filename, indexed by file size and signature (first 16 bytes? Hash of first 512 bytes? Hash on full file?).
When finished, you end up with a dict of file sizes; for each size, a dict of file signatures; for each signature, a list of all files sharing that signature. If your file signature is not based on the full file, or has significant chance of collisions, you can then do a more in-depth comparison of just those colliding files.
I would first start out comparing filenames and filesizes. If you find a match, you could then loop through the bytes of the file to compare them, although this is probably pretty intensive.
I do not know of a library that can do this in python.
精彩评论