I have some data stored in a DB that I want to process. DB access is painfully slow, so I decided to load all data in a dictionary before any processing. However, due to the huge size of the data stored, I get an out of memory error (I see more than 2 gigs being used). So I decided to use a disk data structure, and found out that using shelve is an option. Here's what I do (pseudo python code)
def loadData():
if (#dict exists on disk):
d = shelve.open(name)
return d
else:
d = shelve.open(name, writeback=True)
#access DB and write data to开发者_开发百科 dict
# d[key] = value
# or for mutable values
# oldValue = d[key]
# newValue = f(oldValue)
# d[key] = newValue
d.close()
d = shelve.open(name, writeback=True)
return d
I have a couple of questions,
1) Do I really need the writeBack=True? What does it do?
2) I still get an OutofMemory exception, since I do not exercise any control over when the data is being written to disk. How do I do that? I tried doing a sync() every few iterations but that didn't help either.
Thanks!
writeback=True
forces the shelf to keep in-memory any item ever fetched, and write them back when the shelf is closed. So, it consumes much more memory, and slows down closing.
The advantage of the parameter is that, with it, you don't need the contorted code you show in your comment for mutable items whose mutator is a method -- just
shelf['foobar'].append(23)
works (if shelf
was opened with writeback enabled), assuming the item at key 'foobar'
is a list of course, while it would silently be a no-operation (leaving the item on disk unchanged) if shelf
was opened without writeback -- in the latter case you actually do need to code
thelist = shelf['foobar']
thelist.append(23)
shekf['foobar'] = thelist
in your comment's spirit -- which is stylistically somewhat of a bummer.
However, since you are having memory problems, I definitely recommend not using this dubious writeback option. I think I can call it "dubious" since I was the one proposing and first implementing it, but that was many years ago, and I've mostly repented of doing it -- it generales more confusion (as your Q evidences) than it allows elegance and handiness in moving code originally written to work with dicts (which would use the first idiom, not the second, and thus need rewriting in order to be usable with shelves without traceback). Ah well, sorry, it did seem a good idea at the time.
Using the sqlite3
module is probably your best choice here. You might be able to use sqlite entirely in memory anyway since its memory footprint might be a bit smaller than using python objects anyway. It's generally a better choice than using shelve
anyway; shelve
uses pickle
underneath, which is rarely what you want.
Hell, you could just convert your entire existing database to a sqlite database. sqlite is nice and fast.
精彩评论