开发者

Creating a dictionary of 700k items in Python?

开发者 https://www.devze.com 2023-02-15 14:42 出处:网络
I ha开发者_JS百科ve a class called Point which has only x_coordinate and y_coordinate floats as its attributes. I have about 700000 of those Point objects which I would like to store in a dictionary.

I ha开发者_JS百科ve a class called Point which has only x_coordinate and y_coordinate floats as its attributes. I have about 700000 of those Point objects which I would like to store in a dictionary.

The dictionary's keys are Point objects and the corresponding values are Region objects, which contain about a thousand points within them. Basically, the key p belongs to value r, meaning that point object belongs to a specific region object.

Long story short, this is the primitive loop I'm trying to execute:

look_up_table = {}
for region in all_regions_list:
    for point in region.points_list:
        look_up_table[point] = region

Within all the regions, there are about 700000 Point objects, already loaded into memory. So the 4th line in the code will probably be executed 700k times and the dictionary will have about 700k (key,value) pairs.

Say each Point object occupies 1KB of memory (which isn't really possible), than 700000 Points would occupy about 680MB. And I have more than 3GB of free RAM. Furthermore; those Point and Region objects are already loaded into memory...

However, those simple 4 lines take forever to complete, I've waited for about 2 hours and no luck... It takes about an hour to hash only 10k objects...

So, my question is, is there anything I'm doing wrong? Is there another way to store 700k objects in memory and be able to look-up a certain key in O(1) time?

By the way, below are overriden methods of the Point class:

def __hash__(self, *args, **kwargs):
        """ overriden hash method """
        hashcode = 133
        hashcode = hashcode * 23 + self.x_coordinate
        hashcode = hashcode * 23 + self.y_coordinate
        return hashcode

def __eq__(self, other):
    """ overriden equals method """
    if not other:
        return False
    else:
        return self.__cmp__(other) == 0

def __cmp__(self, other):
    """ overriden compare method """
    if other is not None:
        origin = Point(0.0, 0.0)
        if self.distance_between(origin) < other.distance_between(origin):
            return - 1
        elif self.distance_between(origin) > other.distance_between(origin):
        return 1
        else:
            return 0

Thanks in advance...


There is at least one error in your code -- probably not causing the performance problems, but worth to point out. There is one requirement a __hash__() method must fulfil according to the Python documentation:

The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

With your method definitions, it is perfectly possible that two objects that compare equal have different hashes.


Your __hash__ function is a very poor choice of hash. Try removing that function entirely and see if the default hash function does better.

Also, your __cmp__ function is quite inefficient, because it will call distance_between four times for every comparison. Try precomputing, for each Point, the distance between that point and the origin, and storing it inside the Point object. Then __cmp__ could be something like:

def __cmp__(self, other):
    return other.distance_to_origin - self.distance_to_origin


Why not use a simple (x, y) tuple for your points? Or a namedtuple. It will be way faster.


Comments on this code:

  1. Since your methods have a fixed number of arguments (like def __eq__(self, other):), you do not need to test whether other is None You only need to do that if you have optional arguments.

  2. Your __cmp__ method is suspect. You do not show the definition of distance_between but would imply it is the Euclidean norm of that point; ie, the distance from x,y to the origin of 0,0. Recall that the Euclidean norm of point 3,4 is the same as point 4,3. Are these indeed meant to be the same point in the logic of your program even though point x=3 and y=4 is a different point than x=4 and y=3?

  3. You said that each region has 1,000 points. You said that there are 700,000 points. Since there each region has about 1,000 points, there must be approximately 700 regions. For simplicity, I will use that rather than replicate the Region class.

So, consider this:

import random
import math

size=100

class Point(object):
    def __init__(self,x=0.0,y=0.0):
        self.x=x
        self.y=y

    def __hash__(self):
        t=(self.x,self.y)
        return hash(t)

    def __eq__(self, other):
        return self.__cmp__(other) == 0

    def distance_to_origin(self):
        return math.hypot(self.x,self.y)

    def __cmp__(self, other):
        return other.distance_to_origin() - self.distance_to_origin()

look_up_table = {}
for i in range(0,700):
    for j in range(0,1000):
        x=Point(random.uniform(-size,size),random.uniform(-size,size))
        look_up_table[x]=i

print len(look_up_table), "hash points with ",size*2," range"

The example code implements a faster __hash__ based on the tuple of its constituent parts. On my machine, this dictionary is constructed in 3 or 4 seconds.

As stated, your logic (which I have just streamlined) on __cmp__ may be faulty:

x=Point(1.0,2.0)
y=Point(1.0,2.0)
z=Point(2.0,1.0)
a=Point(2.2,1.1)

print "x=y?", x==y    # as expected...
print "x=a?", x==a    # also
print "x=z?", x==z    # is that what you expect?

Even though Point(1.0,2.0) == Point(2.0,1.0) because they have the same distance to Point(0,0), hash(1.0,2.0) != hash(2.0,1.0). The result is that you will have two objects that compare the same hashed as different objects.


OK, I'm a "late responder here" so:

The problems with __hash__ and __eq__ as used here are well covered. I would say that Named tuple is a hugh pickup in performance here. The next fastest gain ( based on size mostly ) would be to use __slots__. The dead givaway is when you identify that point has "only" the two attributes.

Depending on the actual usage patters I would probably either:

  1. Add the Region attribute to point if you are creating the points and regions at the same time. Then the Point "knows" it's "parent"....(e.g. return self.region in Point)
  2. namedtuples should speed things up for this size problem but if you needed to scale into bigger numbers, you could always cheat and use sqlite3 (in memory or not) or some other dbm library and let someone else worry about how to scale the size infinitely


Have you considered using Numpy arrays? The numpy individual elements are uniform rather than having the overhead of individual Python objects. Way faster and much more memory efficient.

0

精彩评论

暂无评论...
验证码 换一张
取 消