开发者

python: quickest way to split a file into two files randomly

开发者 https://www.devze.com 2023-01-19 03:11 出处:网络
python: what is the quickest way to split a file into two files, each file having half of the number of lines in the original file, such that the lines in each of the two files are开发者_开发技巧 rand

python: what is the quickest way to split a file into two files, each file having half of the number of lines in the original file, such that the lines in each of the two files are开发者_开发技巧 random?

for example: if the file is 1 2 3 4 5 6 7 8 9 10

it could be split into:

3 2 10 9 1

4 6 8 5 7


This sort of operation is often called "partition". Although there isn't a built-in partition function, I found this article: Partition in Python.

Given that definition, you can do this:

import random

def partition(l, pred):
    yes, no = [], []
    for e in l:
        if pred(e):
            yes.append(e)
        else:
            no.append(e)
    return yes, no

lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)

Note that this won't necessarily exactly split the file in two, but it will on average.


You can just load the file, call random.shuffle on the resulting list, and then split it into two files (untested code):

def shuffle_split(infilename, outfilename1, outfilename2):
    from random import shuffle

    with open(infilename, 'r') as f:
        lines = f.readlines()

    # append a newline in case the last line didn't end with one
    lines[-1] = lines[-1].rstrip('\n') + '\n'

    shuffle(lines)

    with open(outfilename1, 'w') as f:
        f.writelines(lines[:len(lines) // 2])
    with open(outfilename2, 'w') as f:
        f.writelines(lines[len(lines) // 2:])

random.shuffle shuffles lines in-place, and pretty much does all the work here. Python's sequence indexing system (e.g. lines[len(lines) // 2:]) makes things really convenient.

I'm assuming that the file isn't huge, i.e. that it will fit comfortably in memory. If that's not the case, you'll need to do something a bit more fancy, probably using the linecache module to read random line numbers from your input file. I think probably you would want to generate two lists of line numbers, using a similar technique to what's shown above.

update: changed / to // to evade issues when __future__.division is enabled.


import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
     if n==len(data)/2:
         c+=1
         f.close()
         f=open("test."+str(c),"w")
     f.write(i)


Other version:

from random import shuffle

def shuffle_split(infilename, outfilename1, outfilename2):
    with open(infilename, 'r') as f:
        lines = f.read().splitlines()

    shuffle(lines)
    half_lines = len(lines) // 2

    with open(outfilename1, 'w') as f:
        f.write('\n'.join(lines.pop() for count in range(half_lines)))
    with open(outfilename2, 'w') as f:
        f.writelines('\n'.join(lines))
0

精彩评论

暂无评论...
验证码 换一张
取 消