what's the quickest way to take a list of files and a name of an output file and merge them into a single file while removing duplicate lines? something like
cat file1 file2 file3 | sort -u > out.file
in python.
prefer not to use system calls.
AND:
what's the quickest way to split a list in python into X chunks (list of lists) as equal as possible? (given a li开发者_开发百科st and X.)
First:
lines = set()
for filename in filenames:
with open(filename) as inF:
lines.update(inF)
with open(outfile, 'w') as outF:
outF.write(''.join(lines))
Second:
def chunk(bigList, x):
chunklen = len(bigList) / x
for i in xrange(0, len(bigList), chunklen):
yield bigList[i:i+chunklen]
listOfLists = list(chunk(bigList, x))
For the first:
lines = []
for filename in filenames:
f = open(filename)
lines.extend(f.read().split('\n')
f.close()
lines = list(set(lines)) #remove duplicates
f = open(outfile_name, 'w')
f.write(''.join(lines))
assuming that the files are a reasonable length as all the data from the files will be stored in memory simultaneously. If you want to preserve the side effect of sort
ordering the lines, then just add lines.sort()
before the file is written.
And the second:
step_size = len(orig_list)/num_chunks
split_list = [orig_list[i:i+step_size] for i in range(0, len(orig_list), step_size)]
精彩评论