I have popened a process which is producing a list of dictionaries, something like:
[{'foo': '1'},{'bar':2},...]
开发者_Go百科
The list takes a long time to create and could be many gigabytes, so I don't want to reconstitute it in memory and then iterate over it.
How can I parse the partially completed list such that I can process each dictionary as it is received?
The Python tokenizer is available as part of the Python standard library, module tokenize. It relies for its input on receiving at the start a readline
function (which must supply to it a "line" of input), so it can operate incrementally -- if there are no newlines in your input, you can simulate that as long as you can identify spots where adding a newline is innocuous (not breaking up a token -- thanks to the starting [
everything will be one "logical" line anyway). The only tokens that will require care to avoid being broken will be quoted strings. I'm not pursuing this in depth at this time since if you actually have newlines in your input you won't need to worry.
From the stream of tokens you can reconstruct the string representing each dict in the list (from an opening brace token, to the balancing closed bracket), and use ast.literal_eval to get the corresponding Python dict.
So, do you have newlines in your input? if so, then the whole task should be very easy.
Pickle each dictionary separately. Shelve can help you do this.
Writer
import shelve
db= shelve.open(filename)
count= 0
for ...whatever...
# build the object
db[count]= object
count += 1
db['size']= count
db.close
Reader
import shelve
db= shelve.open(filename)
size= db['size']
for i in xrange(size):
object= db[i]
# process the object
db.close()
精彩评论