[Python开发者_C百科 3.1]
My program takes a long time to run just because of the pickle.load
method on a huge data structure. This makes debugging very annoying and time-consuming: every time I make a small change, I need to wait for a few minutes to see if the regression tests passed.
I would like replace pickle
with an in-memory data structure.
I thought of starting a python program in one process, and connecting to it from another; but I am afraid the inter-process communication overhead will be huge.
Perhaps I could run a python function from the interpreter to load the structure in memory. Then as I modify the rest of the program, I can run it many times (without exiting the interpreter in between). This seems like it would work, but I'm not sure if I will suffer any overhead or other problems.
You can use mmap
to open a view on the same file in multiple processes, with access at almost the speed of memory once the file is loaded.
First you can pickle different parts of the hole object using this method:
# gen_objects.py
import random
import pickle
class BigBadObject(object):
def __init__(self):
self.a_dictionary={}
for x in xrange(random.randint(1, 1000)):
self.a_dictionary[random.randint(1,98675676)]=random.random()
self.a_list=[]
for x in xrange(random.randint(1000, 10000)):
self.a_list.append(random.random())
self.a_string=''.join([chr(random.randint(65, 90))
for x in xrange(random.randint(100, 10000))])
if __name__=="__main__":
output=open('lotsa_objects.pickled', 'wb')
for i in xrange(10000):
pickle.dump(BigBadObject(), output, pickle.HIGHEST_PROTOCOL)
output.close()
Once you generated the BigFile in various separate parts you can read it with a python program with several running at the same time reading each one different parts.
# reader.py
from threading import Thread
from Queue import Queue, Empty
import cPickle as pickle
import time
import operator
from gen_objects import BigBadObject
class Reader(Thread):
def __init__(self, filename, q):
Thread.__init__(self, target=None)
self._file=open(filename, 'rb')
self._queue=q
def run(self):
while True:
try:
one_object=pickle.load(self._file)
except EOFError:
break
self._queue.put(one_object)
class uncached(object):
def __init__(self, filename, queue_size=100):
self._my_queue=Queue(maxsize=queue_size)
self._my_reader=Reader(filename, self._my_queue)
self._my_reader.start()
def __iter__(self):
while True:
if not self._my_reader.is_alive():
break
# Loop until we get something or the thread is done processing.
try:
print "Getting from the queue. Queue size=", self._my_queue.qsize()
o=self._my_queue.get(True, timeout=0.1) # Block for 0.1 seconds
yield o
except Empty:
pass
return
# Compute an average of all the numbers in a_lists, just for show.
list_avg=0.0
list_count=0
for x in uncached('lotsa_objects.pickled'):
list_avg+=reduce(operator.add, x.a_list)
list_count+=len(x.a_list)
print "Average: ", list_avg/list_count
This way of reading the pickle file will take 1% of the time it takes in the other way. This is because you are running 100 parallel threads at the same time.
精彩评论