开发者

Comparing two files for identical lines where the order doesn't matter

开发者 https://www.devze.com 2023-01-24 14:14 出处:网络
I have two files (which could be up to 150,000 lines long; each line is 160 bytes), which I\'d like to check to see if the lines in each are the same. diff won\'t work for me (directly) because a smal

I have two files (which could be up to 150,000 lines long; each line is 160 bytes), which I'd like to check to see if the lines in each are the same. diff won't work for me (directly) because a small percentage of the lines occur in a different order in the two files. Typically, a pair of lines will be transposed.

What's the best way to see if the same lines appear in both files, but where o开发者_运维技巧rder doesn't matter? Thanks, Chris


Although it's a slightly expensive way to do it (for anything larger I'd rethink this), I'd fire up python and do the following:

filename1 = "WHATEBVER YOUR FILENAME IS"
filename2 = "WHATEVER THE OTHER ONE IS"
file1contents = set(open(filename1).readlines())
file2contents = set(open(filename2).readlines())
if file1contents == file2contents:
    print "Yup they're the same!"
else:
    print "Nope, they differ.  In file2, not file1:\n\n"
    for diffLine in file2contents - file1contents:
        print "\t", diffLine
    print "\n\nIn file1, not file2:\n\n"
    for diffLine in file1contents - file2contents:
        print "\t", diffLine

That'll print the different lines if they differ.


For only 150k lines, just hash each line and store them ordered in a lookup table. Then for each line in file two just perform the lookup.


Another python script to do this:

#!/usr/bin/env python
import sys

file1 = sys.argv[1]
file2 = sys.argv[2]

lines1 = open(file1,'r').readlines()
lines2 = open(file2,'r').readlines()
lines1.sort()
lines2.sort()

s = ''
for i,line in enumerate(lines1):
    if lines2[i] != line:
        print '> %s' % line
        print '< %s' % lines2[i]
        s = 'not'

print 'file %s is %s like file %s' % (file1, s, file2)
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号