Fastest method in merging of the two: dicts vs lists_问答_开发者

I'm doing some indexing and memory is sufficient but CPU isn't. So I have one huge dictionary and then a smaller dictionary I'm merging into the bigger one:

big_dict = {"the" : {"1" : 1, "2" : 1, "3" : 1, "4" : 1, "5" : 1}}
smaller_dict = {"the" : {"6" : 1, "7" : 1}}
#after merging
resulting_dict = {"the" : {"1" : 1, "2" : 1, "3" : 1, "4" : 1, "5" : 1, "6" : 1, "7" : 1}}

My question is for the values in both dicts, should I use 开发者_如何学Ca dict (as displayed above) or list (as displayed below) when my priority is to use as much memory as possible to gain the most out of my CPU?

For clarification, using a list would look like:

big_dict = {"the" : [1, 2, 3, 4, 5]}
smaller_dict = {"the" : [6,7]}
#after merging
resulting_dict = {"the" : [1, 2, 3, 4, 5, 6, 7]}

Side note: The reason I'm using a dict nested into a dict rather than a set nested in a dict is because JSON won't let me do json.dumps because a set isn't key/value pairs, it's (as far as the JSON library is concerned) {"a", "series", "of", "keys"}

Also, after choosing between using dict to a list, how would I go about implementing the most efficient, in terms of CPU, method of merging them?

I appreciate the help.

Hmmm. I would first go for a dict-of-dicts approach, as Python has one of the most fine-tuned dict implementation, so I highly doubt you can get any better with using a dict-of-lists.

As for merging the dicts, this should be enough:

for key, value in smaller_dict.iteritems():
    try:
        big_dict[key].update(value)
    except KeyError:
        big_dict[key] = dict(value)

I would probably also experiment with subclassing json.JSONEncoder to be able to serialize set types:

class SetEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, set):
            return dict.fromkeys(obj)
        return json.JSONEncoder.default(self, obj)

This latter method might add some overhead on the serialization side, however, and you will also need to convert these dicts to sets upon deserialization, either by subclassing json.JSONDecoder or doing it yourself in an extra step.

It really depends on what you want to do with the values in your inner lists/dictionaries. If, when you add a new entry, you want the inner list to have only unique values, then the list implementation will be much slower for large lists. It scales roughly at O(n), rather than O(1) [average case] for dictionaries.

If you don't care about multiples in those inner lists, then it is a closer thing.

I would use dictionaries, as you have. Python's dictionaries are very efficient (speaking as someone who's tried to implement dictionary data structures in C for real time applications).

As for not using sets. It would be better (since memory isn't an issue, you say), to tweak the serialization, and have the speed critical part of your code be as simple as possible. After deserialization, just go through and convert the lists to sets:

big_dict = {"the" : [1, 2, 3, 4, 5]} # imagine we got this from JSON

for key, value in big_dict: 
    big_dict[key] = set(value)

Should do it. Unless you are serializing / deserializing the whole index all the time, this added pre-processing costs should be amortized over enough requests not to matter.

Alternatively you can register encoders and decoders with JSON so you can do this conversion automatically. I usually don't bother when the issue is something so small and contained, however.

So in your dictionary based approach you could do:

for key, value in smaller_dict.items():
    if key in big_dict: 
        big_dict[key].update(value)
    else:
        big_dict[key] = value

If you want the big_dict to only copy the dictionary, use dict(value) instead of value on the last line. You could also use try: and except KeyError in the last loop, but if...else is a fraction faster (on my machine, YMMV).

Any hashing-container will be better than a list for this kind of stuff.

I'd still use a set instead of a dict; if you're having trouble with json.dumps you can get around it by converting the set to a dict when you go to serialize: dict.fromkeys(the_set, 1) And for pulling them out: set(the_dict.keys())
It's easier than mucking about with registering JSON providers.

And as for merging: merged_set = bigger_set.union(smaller_set)