BACKGROUND
The issue I'm working with is as follows:
Within the context of an experiment I am designing for my research, I produce a large number of large (length 4M) arrays which are somewhat sparse, and thereby could be stored as
scipy.sparse.lil_matrix
instances, or simply asscipy.array
instances (the space gain/loss isn't the issue here).Each of these arrays must be paired with a string (namely a word) for the data to make sense, as they are semantic vectors representing the meaning of that string. I need to preserve this pairing.
The v开发者_如何转开发ectors for each word in a list are built one-by-one, and stored to disk before moving on to the next word.
They must be stored to disk in a manner which could be then retrieved with dictionary-like syntax. For example if all the words are stored in a DB-like file, I need to be able to open this file and do things like
vector = wordDB[word]
.
CURRENT APPROACH
What I'm currently doing:
Using
shelve
to open a shelf namedwordDB
Each time the vector (currently using
lil_matrix
fromscipy.sparse
) for a word is built, storing the vector in the shelf:wordDB[word] = vector
When I need to use the vectors during the evaluation, I'll do the reverse: open the shelf, and then recall vectors by doing
vector = wordDB[word]
for each word, as they are needed, so that not all the vectors need be held in RAM (which would be impossible).
The above 'solution' fits my needs in terms of solving the problem as specified. The issue is simply that when I wish to use this method to build and store vectors for a large amount of words, I simply run out of disk space.
This is, as far as I can tell, because shelve
pickles the data being stored, which is not an efficient way of storing large arrays, thus rendering this storage problem intractable with shelve
for the number of words I need to deal with.
PROBLEM
The question is thus: is there a way of serializing my set of arrays which will:
Save the arrays themselves in compressed binary format akin to the
.npy
files generated byscipy.save
?Meet my requirement that the data be readable from disk as a dictionary, maintaining the association between words and arrays?
as JoshAdel already suggested, I would go for HDF5, the simplest way is to use h5py:
http://h5py.alfven.org/
you can attach several attributes to an array with a dictionary like sintax:
dset.attrs["Name"] = "My Dataset"
where dset is your dataset which can be sliced exactly as a numpy array, but in the background it does not load all the array into memory.
I would suggest to use scipy.save and have an dictionnary between the word and the name of the files.
Have you tried just using cPickle
to pickle the dictionary directly using:
import cPickle
DD = dict()
f = open('testfile.pkl','wb')
cPickle.dump(DD,f,-1)
f.close()
Alternatively, I would just save the vectors in a large multidimensional array using hdf5 or netcdf if necessary since this allows you to open a large array without bringing it all into memory at once and then get slices as needed. You can then associate the words as an additional group in the netcdf4/hdf5 file and use the common indices to quickly associate the appropriate slice from each group, or just name the group the word and then have the data be the vector. You'd have to play around with which is more efficient.
http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html
Pytables also might be a useful storage layer on top of HDF5:
http://www.pytables.org
Avoid using shelve
, it's bug ridden and has cross-platform issues.
The memory issue, however, has nothing to do with shelve
. Numpy arrays provide efficient implementation of the pickle protocol and there is little memory overhead to cPickle.dumps(protocol=-1)
, compared to binary .npy
(only the extra headers in pickle, basically).
So if binary/pickle isn't enough, you'll have to go for compression. Have a look at pytables or h5py (difference between the two).
If specifying the binary protocol in pickle is enough, you can consider something more lightweight than hdf5: check out sqlitedict for a replacement of shelve
. It has no additional dependencies.
精彩评论