开发者

nested dictionaries or tuples for key?

开发者 https://www.devze.com 2023-03-29 15:44 出处:网络
Suppose there is a structure like this: {\'key1\' : { \'key2\' : { .... { \'keyn\' : \'value\' } ... } } }

Suppose there is a structure like this:

{'key1' : { 'key2' : { .... { 'keyn' : 'value' } ... } } }

Using python, I'm trying to determine advantages/disadvantages of two different approaches:

{'key1' : { 'key2' : { .... { 'keyn' : 'value' } ... } } } # A. nested dictionary
{('key1', 'key2', ...., 'keyn') : 'value'} # B. a dictionary with a tuple used like 开发者_如何学Gokey

Then I'm interested to know, what is the best (A or B) in terms of:

  • Memory occupation
  • Complexity in insertion (considering collision-avoiding alorithms, etc...)
  • Complexity in find


Without going into details (which are highly implementation-dependent anyway and may be invalidated by the next genius to come along and tweak the dictionary implementation):

  • For memory overhead: Each object has some overhead (e.g. refcount and type; an empty object is 8 bytes and an empty tuple is 28 bytes), but hash tables need to store hash, key and value and usually use more buckets than currently needed to avoid collision. Tuples on the other hand can't be resized and don't have collisions, i.e. a N-tuple can simple allocate N pointers to the contained objects and be done. This leads to noticeable differences in memory consumption.
  • For lookup and insertion complexity (the two are identical in this regard): Be it a string or a tuple, collisions are rather unlikely in CPython's dict implementation, and resolved very efficiently. More keys (because you flatten the key space by combining the keys in tuples) may seem to increase the likelihood of collisions, more keys also lead to more buckets (AFAIK the current implementation tries to keep the load factor between 2/3), which in turn makes collision less likely. Also, you don't need more hashing (well, one more function call and some C-level xor-ing for the tuple hash, but that's negible) to get to a value.

You see, there shouldn't be any noticeable difference in performance, although some memory difference. The latter won't be notable though, I think. A one-element dict is 140 bytes, a ten-element tuple is 140 bytes as well (according to Python 3.2 sys.getsizeof). So even with the (already unrealistic, says my gut-feeling) ten-level nesting, you'd have slightly more than one kB of difference - possibly less if the nested dicts have multiple items (depends on the exact load factor). That's too much for a data-crunching application that has hundreds of such data structure im memory, but most objects aren't created that frequently.

You should simply ask yourself which model is more appropriate for your problem. Consider that using a tuple of keys requires you to have all keys for a value available at once, while using nested dictionaries allows getting there incrementally.


If you need to use the whole combination of key1 to keyn to get value, you can flip the dict like I suggested below for O(nk*nv) (number of keys * number of values) or use the tuple method above.

Assuming you need to build the tuple on insertion and again when you need to get a value, both will be O(nk) where nk is the number of keys.

The nested dict version might be more space efficient if you're nested fairly deeply (there are lots of values that share a partial ordered list of keys), and getting a value will still be O(nk), but probably slower than the tuple version.

However, insertion will be slower, though I can't quantify it's speed. You will have to construct at least one layer of dict for each insertion, and test for the existence of dicts at the previous levels.

There are many recipes for recursive defaultdicts that would simplify insertion from a coding perspective, but it wouldn't actually speed things up.

The tuple method is both simpler and faster for insertion, but may take more memory depending on your nesting.


Original answer from before I knew the requirements

Why not

{'key1' : 'value', 'key2' : 'value', .... 'keyn' : 'value' } 

It's just storing a reference to value at each location, not value itself, so the memory usage would be less than the nested dict version, and not much larger than the tuple version, unless you have an extremely large number of values.

For time complexity of Python standard type operations, see the Python wiki.

Basically, insertion or getting one item on average is O(1).

Getting all the keys for a value would on average be O(n):

[key for key in mydict if mydict[key] == value]

However, if adding keys, or searching for all keys, is the usual operation, your dict is flipped. You want:

{'value': [key1, key2, ... keyn]}

This way you can add keys with just mydict[value].append(newkey) and get all the keys with just mydict[value] and both will be on average O(1).


Memory Consumption Testing

I've written a small script to test it. It has some limitations though, the keys are made from integers linearly distirbuted (i.e. range(N)), my findings are the following.

With a 3-level nesting, i.e. dict[a][b][c] vs dict[a,b,c] where each sub index goes from 0 to 99, I find the following:

With large values (list(x for x in range(100))):

> memory.py nested 
Memory usage: 1049.0 MB
> memory.py flat  
Memory usage: 1149.7 MB

and with small values ([]):

> memory.py nested
Memory usage: 134.1 MB
> memory.py flat
Memory usage: 234.8 MB

 Open questions

  • Why is this happening?
  • Would this change with different indices, e.g. non-consecutive ones?

Script

#!/usr/bin/env python3
import resource
import random
import itertools
import sys
import copy
from os.path import basename
from collections import defaultdict
 
# constants
index_levels = [100, 100, 100]
value_size   = 100 # try values like 0

def memory_usage():
    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

_object_mold = list(x for x in range(value_size)) # performance hack
def create_object():
    return copy.copy(_object_mold)

# automatically create nested dict
# http://code.activestate.com/recipes/578057-recursive-defaultdict/
f = lambda: defaultdict(f)
my_dict = defaultdict(f)

# options & usage
try:
    dict_mode = sys.argv[1]
    if dict_mode not in ['flat', 'nested']: # ugly hack
        raise Error()
except:
    print("Usage: {} [nested | flat]".format(basename(sys.argv[0])))
    exit()
 
index_generator = [range(level) for level in index_levels]

if dict_mode == "flat":
    for index in itertools.product(*index_generator):
        my_dict[index] = create_object()
elif dict_mode == "nested":
    for index in itertools.product(*index_generator):
        sub_dict = my_dict
        for sub_index in index[:-1]:          # iterate into lowest dict
            sub_dict = sub_dict[sub_index]
        sub_dict[index[-1]] = create_object() # finally assign value

print("Memory usage: {:.1f} MB".format(memory_usage() / 1024**2))


Performance Testing

I performed tests for looping, retrieval, and insertion for a nested dictionary and a dictionary with a tuple. They are one level deep, 2000.000 values. I also did retrieval and insertion for the tuple dict with the tuple already created.

These are the results. I think you cannot really bind conclusions to the std dev.

-

keydictinsertion: Mean +- std dev: 615 ms +- 42 ms  
keydictretrieval: Mean +- std dev: 475 ms +- 77 ms  
keydictlooping: Mean +- std dev: 76.2 ms +- 7.4 ms  

nesteddictinsertion: Mean +- std dev: 200 ms +- 7 ms  
nesteddictretrieval: Mean +- std dev: 256 ms +- 32 ms  
nesteddictlooping: Mean +- std dev: 88.5 ms +- 14.4 ms  

Test were the tuple was already created for the keydict  
keydictinsertionprepared: Mean +- std dev: 432 ms +- 26 ms  
keydictretrievalprepared: Mean +- std dev: 267 ms +- 14 ms

-

As you can can see, the nesteddict if often faster than the dict with a single key. Even when giving the keydict a tuple directly without the tuple creation step, insertion was still much slower. It seemed that the additional creation of an inner dict is not so much cost. Defaultdict has probably a fast implementation. Retrieval was actually almost equal when it did not have to create a tuple, the same with looping.

Testing is done with perf from the command line. Scripts are below.

>>>>>>> nesteddictinsertion
python -m perf timeit -v -s "
from collections import defaultdict
" " 
d = defaultdict(dict)
for i in range(2000):
    for j in range(1000):
        d[i][j] = 1
"
>>>>>>> nesteddictlooping
python -m perf timeit -v -s "
from collections import defaultdict
d = defaultdict(dict)
for i in range(2000):
    for j in range(1000):
        d[i][j] = 1
" "
for i, inner_dict in d.items():
    for j, val in inner_dict.items():
        i
        j
        val
"
>>>>>>> nesteddictretrieval
python -m perf timeit -v -s "
from collections import defaultdict
d = defaultdict(dict)
for i in range(2000):
    for j in range(1000):
        d[i][j] = 1
" "
for i in range(2000):
    for j in range(1000):
        d[i][j]
"
>>>>>>> keydictinsertion
python -m perf timeit -v -s "
from collections import defaultdict
" " 
d = {}
for i in range(2000):
    for j in range(1000):
        d[i, j] = 1
"
>>>>>>> keydictinsertionprepared
python -m perf timeit -v -s "
from collections import defaultdict
keys = [(i, j) for i in range(2000) for j in range(1000)]
" " 
d = {}
for key in keys:
    d[key] = 1
"
>>>>>>> keydictlooping
python -m perf timeit -v -s "
from collections import defaultdict
d = {}
for i in range(2000):
    for j in range(1000):
        d[i, j] = 1
" "
for key, val in d.items():
    key
    val
"
>>>>>>> keydictretrieval
python -m perf timeit -v -s "
from collections import defaultdict
d = {}
for i in range(2000):
    for j in range(1000):
        d[i, j] = 1
" "
for i in range(2000):
    for j in range(1000):
        d[i, j]
"
>>>>>>> keydictretrievalprepared
python -m perf timeit -v -s "
from collections import defaultdict
d = {}
keys = [(i, j) for i in range(2000) for j in range(1000)]
for key in keys:
    d[key] = 1
" "
for key in keys:
    d[key]
"
0

精彩评论

暂无评论...
验证码 换一张
取 消