Is IO more efficient, due to the linux disk buffer cache, when storing frequently accessed python objects as separate cPickle files instead of storing all objects in one large shelf?
Does the disk buffer cache operate differently in these two scenarios with respect to efficiency?
There may be thousands of large files (gen开发者_StackOverflow社区erally around 100Mb, but sometimes 1Gb), but much RAM (eg 64 Gb).
I don't know of any theoretical way to decide which method is faster, and even if I did, I'm not sure I would trust it. So let's write some code and test it.
If we package our pickle/shelve managers in classes with a common interface, then it will be easy to swap them in and out of your code. So if at some future point you discover one is better than the other (or discover some even better way) all you have to do is write a class with the same interface and you'll be able to plug the new class into your code with very little modification to anything else.
test.py:
import cPickle
import shelve
import os
class PickleManager(object):
def store(self,name,value):
with open(name,'w') as f:
cPickle.dump(value,f)
def load(self,name):
with open(name,'r') as f:
return cPickle.load(f)
class ShelveManager(object):
def __enter__(self):
if os.path.exists(self.fname):
self.shelf=shelve.open(self.fname)
else:
self.shelf=shelve.open(self.fname,'n')
return self
def __exit__(self,ext_type,exc_value,traceback):
self.shelf.close()
def __init__(self,fname):
self.fname=fname
def store(self,name,value):
self.shelf[name]=value
def load(self,name):
return self.shelf[name]
def write(manager):
for i in range(100):
fname='/tmp/{i}.dat'.format(i=i)
data='The sky is so blue'*100
manager.store(fname,data)
def read(manager):
for i in range(100):
fname='/tmp/{i}.dat'.format(i=i)
manager.load(fname)
Normally, you'd use PickleManager like this:
manager=PickleManager()
manager.load(...)
manager.store(...)
while you'd use the ShelveManager like this:
with ShelveManager('/tmp/shelve.dat') as manager:
manager.load(...)
manager.store(...)
But to test performance, you could do something like this:
python -mtimeit -s'import test' 'with test.ShelveManager("/tmp/shelve.dat") as s: test.read(s)'
python -mtimeit -s'import test' 'test.read(test.PickleManager())'
python -mtimeit -s'import test' 'with test.ShelveManager("/tmp/shelve.dat") as s: test.write(s)'
python -mtimeit -s'import test' 'test.write(test.PickleManager())'
At least on my machine, the results came out like this:
read (ms) write (ms)
PickleManager 9.26 7.92
ShelveManager 5.32 30.9
So it looks like ShelveManager may be faster at reading, but PickleManager may be faster at writing.
Be sure to run these tests yourself. Timeit results can vary due to version of Python, OS, filesystem type, hardware, etc.
Also, note my write
and read
functions generate very small files. You'll want to test this on data more similar to your use case.
精彩评论