I have a list of text files file1.txt, file2.txt, file3.txt .. filen.txt
that I need to shuffle creating one single big file as result *.
Requirements:
1. The records of a given file need to be reversed before being shuffled 2. The records of a given file should keep the reversed order in the destination file 3. I don't know how many files I need to shuffle so the code should be generic as possible (allowing to declare the file names in a list for example) 4. Files could have different sizesExample:
File1.txt --------- File1Record1 File1Record2 File1Record3 File1Record4 File2.txt --------- File2Record1 File2Record2 File3.txt --------- File3Record1 File3Record2 File3Record3 File3Record4 File3Record5
the output should be something like this:
ResultFile.txt -------------- File3Record5 -| File2Record2 | File1Record4 | File3Record4 -| File2Record1 | File1Record3 |-->File3 records are shuffled with the other records and File3Record3 -| are correctly "reversed" and they kept the correct File1Record2 | ordering File3Record2 -| File1Record1 | File3Record1 -|
* I'm not crazy; I have to import these files (blog posts) using the resultfile.txt as input
EDIT:
the result could have any sort you want, completely or partially shuffled, uniformly interleaved, it does not matter. What it does matter is that points 1. and 2. are both开发者_高级运维 honoured.What about this:
>>> l = [["1a","1b","1c","1d"], ["2a","2b"], ["3a","3b","3c","3d","3e"]]
>>> while l:
... x = random.choice(l)
... print x.pop(-1)
... if not x:
... l.remove(x)
1d
1c
2b
3e
2a
3d
1b
3c
3b
3a
1a
You could optimize it in various ways, but that's the general idea. This also works if you cannot read the files at once but need to iterate them because of memory restrictions. In that case
- read a line from the file instead of popping from a list
- check for EOF instead of empty lists
you could try the following: in a first step you zip()
the reversed()
items of the list:
zipped = zip(reversed(lines1), reversed(lines2), reversed(lines3))
then you can concatenate the items in zipped again:
lst = []
for triple in zipped:
lst.append(triple)
finally you have to remove all None
s added by zip()
lst.remove(None)
filenames = [ 'filename0', ... , 'filenameN' ]
files = [ open(fn, 'r') for fn in filenames ]
lines = [ f.readlines() for f in files ]
output = open('output', 'w')
while len(lines) > 0:
l = random.choice( lines )
if len(l)==0:
lines.remove(l)
else:
output.write( l.pop() )
output.close()
One bite may seem magical here: the lines read from files don't need reversing, because when we write them to output file we use list.pop()
which takes items from the end of the list (here the contents of the file).
A simple solution might be to create a list of lists, and then pop a line off a random list until they're all exhausted:
>>> import random
>>> filerecords = [['File{0}Record{1}'.format(i, j) for j in range(5)] for i in range(5)]
>>> concatenation = []
>>> while any(filerecords):
... selection = random.choice(filerecords)
... if selection:
... concatenation.append(selection.pop())
... else:
... filerecords.remove(selection)
...
>>> concatenation
['File1Record4', 'File3Record4', 'File0Record4', 'File0Record3', 'File0Record2',
'File4Record4', 'File0Record1', 'File3Record3', 'File4Record3', 'File0Record0',
'File4Record2', 'File2Record4', 'File4Record1', 'File3Record2', 'File4Record0',
'File2Record3', 'File1Record3', 'File2Record2', 'File2Record1', 'File3Record1',
'File3Record0', 'File1Record2', 'File2Record0', 'File1Record1', 'File1Record0']
I strongly recommend investing some time to read Generator Tricks for Systems Programmers(PDF). It's from a presentation at PyCon 08 and it deals specifically with processing arbitrarily large log files. The reversal aspect is an interesting wrinkle, but the rest of the presentation should speak directly to your problem.
filelist = (
'file1.txt',
'file2.txt',
'file3.txt',
)
all_records = []
max_records = 0
for f in filelist:
fp = open(f, 'r')
records = fp.readlines()
if len(records) > max_records:
max_records = len(records)
records.reverse()
all_records.append(records)
fp.close()
all_records.reverse()
res_fp = open('result.txt', 'w')
for i in range(max_records):
for records in all_records:
try:
res_fp.write(records[i])
except IndexError:
pass
i += 1
res_fp.close()
I'm not a python zen master, but here's my take.
import random
#You have you read everything into a list from at least one of the files.
fin = open("filename1","r").readlines()
# tuple of all of the files.
fls = ( open("filename2","r"),
open("filename3","r"), )
for fl in fls: #iterate through tuple
curr = 0
clen = len(fin)
for line in fl: #iterate through a file.
# If we're at the end or 1 is randomly chosen, insert at current position.
if curr > clen or round(random.random()):
fin.insert(curr,line)
clen = len(fin)
curr +=1 #increment current index.
# when you're *done* reverse. It's easier.
fin.reverse()
Unfortunately with this it becomes obvious that this is a weighted distrobution. This can be fixed by calculating the length of each of the files and multiplying the call to random by certain probability based on that. I'll see if I can't provide that at some later point.
A possible merging function is available in the standard library. It's intended to merge sorted lists to make sorted combined lists; garbage in, garbage out, but it does have the desired property of maintaining sublist order no matter what.
def merge_files(output, *inputs):
# all parameters are opened files with appropriate modes.
from heapq import merge
for line in heapq.merge(*(reversed(tuple(input)) for input in inputs)):
output.write(line)
精彩评论