I'm working with large CSV. How can I take a random 开发者_开发知识库sample of rows—say, 200 total—and recombine them into a CSV with the same structure as the original?
The procedure I would use is as follows:
- Generate 200 unique numbers between 0 and the number of lines in the CSV file.
- Read each line of the CSV file and keep a track of which line number your are reading. If its line number matches one of the numbers above, then output it.
Use the Resevoir Sampling random sampling technique that does not require all records be in memory or the actual number of records be known. With it, you stream in you records one-by-one and probabilistically select them into the sample. Once the stream is exhausted, output the final sample records. The technique guarantees each record in the stream has the same probability of being in the final sample. That is to say, it generates a simple random sample.
You can use random module's random.sample method to randomize a list of line offsets as shown below.
import random
# Fetching line offsets.
# Courtesy: Adam Rosenfield's tip about how to read a HUGE text file.
# http://stackoverflow.com/questions/620367/
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Part where you pick the random lines and copy to your new file
# My 2 cents.
randoffsets = random.sample(line_offset, 200)
with open('your_file') as f:
for k in randoffsets:
f.seek(k)
f.readline() # and append to your new file
You could try to use linecache if it works for you but since linecache reads the entire file into memory I'm not sure how well it would work for a 6GB file.
精彩评论