python: what is the quickest way to split a file into two files, each file having half of the number of lines in the original file, such that the lines in each of the two files are开发者_开发技巧 random?
for example: if the file is 1 2 3 4 5 6 7 8 9 10
it could be split into:
3 2 10 9 1
4 6 8 5 7
This sort of operation is often called "partition". Although there isn't a built-in partition function, I found this article: Partition in Python.
Given that definition, you can do this:
import random
def partition(l, pred):
yes, no = [], []
for e in l:
if pred(e):
yes.append(e)
else:
no.append(e)
return yes, no
lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)
Note that this won't necessarily exactly split the file in two, but it will on average.
You can just load the file, call random.shuffle
on the resulting list, and then split it into two files (untested code):
def shuffle_split(infilename, outfilename1, outfilename2):
from random import shuffle
with open(infilename, 'r') as f:
lines = f.readlines()
# append a newline in case the last line didn't end with one
lines[-1] = lines[-1].rstrip('\n') + '\n'
shuffle(lines)
with open(outfilename1, 'w') as f:
f.writelines(lines[:len(lines) // 2])
with open(outfilename2, 'w') as f:
f.writelines(lines[len(lines) // 2:])
random.shuffle
shuffles lines
in-place, and pretty much does all the work here. Python's sequence indexing system (e.g. lines[len(lines) // 2:]
) makes things really convenient.
I'm assuming that the file isn't huge, i.e. that it will fit comfortably in memory. If that's not the case, you'll need to do something a bit more fancy, probably using the linecache
module to read random line numbers from your input file. I think probably you would want to generate two lists of line numbers, using a similar technique to what's shown above.
update: changed /
to //
to evade issues when __future__.division
is enabled.
import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
if n==len(data)/2:
c+=1
f.close()
f=open("test."+str(c),"w")
f.write(i)
Other version:
from random import shuffle
def shuffle_split(infilename, outfilename1, outfilename2):
with open(infilename, 'r') as f:
lines = f.read().splitlines()
shuffle(lines)
half_lines = len(lines) // 2
with open(outfilename1, 'w') as f:
f.write('\n'.join(lines.pop() for count in range(half_lines)))
with open(outfilename2, 'w') as f:
f.writelines('\n'.join(lines))
精彩评论