I ha开发者_JS百科ve a class called Point
which has only x_coordinate
and y_coordinate
floats as its attributes. I have about 700000 of those Point objects which I would like to store in a dictionary.
The dictionary's keys are Point
objects and the corresponding values are Region
objects, which contain about a thousand points within them. Basically, the key p
belongs to value r
, meaning that point object belongs to a specific region object.
Long story short, this is the primitive loop I'm trying to execute:
look_up_table = {}
for region in all_regions_list:
for point in region.points_list:
look_up_table[point] = region
Within all the regions, there are about 700000 Point objects, already loaded into memory. So the 4th line in the code will probably be executed 700k times and the dictionary will have about 700k (key,value) pairs.
Say each Point object occupies 1KB of memory (which isn't really possible), than 700000 Points would occupy about 680MB. And I have more than 3GB of free RAM. Furthermore; those Point and Region objects are already loaded into memory...
However, those simple 4 lines take forever to complete, I've waited for about 2 hours and no luck... It takes about an hour to hash only 10k objects...
So, my question is, is there anything I'm doing wrong? Is there another way to store 700k objects in memory and be able to look-up a certain key in O(1) time?
By the way, below are overriden methods of the Point
class:
def __hash__(self, *args, **kwargs):
""" overriden hash method """
hashcode = 133
hashcode = hashcode * 23 + self.x_coordinate
hashcode = hashcode * 23 + self.y_coordinate
return hashcode
def __eq__(self, other):
""" overriden equals method """
if not other:
return False
else:
return self.__cmp__(other) == 0
def __cmp__(self, other):
""" overriden compare method """
if other is not None:
origin = Point(0.0, 0.0)
if self.distance_between(origin) < other.distance_between(origin):
return - 1
elif self.distance_between(origin) > other.distance_between(origin):
return 1
else:
return 0
Thanks in advance...
There is at least one error in your code -- probably not causing the performance problems, but worth to point out. There is one requirement a __hash__()
method must fulfil according to the Python documentation:
The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
With your method definitions, it is perfectly possible that two objects that compare equal have different hashes.
Your __hash__
function is a very poor choice of hash. Try removing that function entirely and see if the default hash function does better.
Also, your __cmp__
function is quite inefficient, because it will call distance_between
four times for every comparison. Try precomputing, for each Point
, the distance between that point and the origin, and storing it inside the Point
object. Then __cmp__
could be something like:
def __cmp__(self, other):
return other.distance_to_origin - self.distance_to_origin
Why not use a simple (x, y)
tuple for your points? Or a namedtuple
. It will be way faster.
Comments on this code:
Since your methods have a fixed number of arguments (like
def __eq__(self, other):
), you do not need to test whetherother
isNone
You only need to do that if you have optional arguments.Your
__cmp__
method is suspect. You do not show the definition ofdistance_between
but would imply it is the Euclidean norm of that point; ie, the distance from x,y to the origin of 0,0. Recall that the Euclidean norm of point 3,4 is the same as point 4,3. Are these indeed meant to be the same point in the logic of your program even though point x=3 and y=4 is a different point than x=4 and y=3?You said that each region has 1,000 points. You said that there are 700,000 points. Since there each region has about 1,000 points, there must be approximately 700 regions. For simplicity, I will use that rather than replicate the
Region
class.
So, consider this:
import random
import math
size=100
class Point(object):
def __init__(self,x=0.0,y=0.0):
self.x=x
self.y=y
def __hash__(self):
t=(self.x,self.y)
return hash(t)
def __eq__(self, other):
return self.__cmp__(other) == 0
def distance_to_origin(self):
return math.hypot(self.x,self.y)
def __cmp__(self, other):
return other.distance_to_origin() - self.distance_to_origin()
look_up_table = {}
for i in range(0,700):
for j in range(0,1000):
x=Point(random.uniform(-size,size),random.uniform(-size,size))
look_up_table[x]=i
print len(look_up_table), "hash points with ",size*2," range"
The example code implements a faster __hash__
based on the tuple of its constituent parts. On my machine, this dictionary is constructed in 3 or 4 seconds.
As stated, your logic (which I have just streamlined) on __cmp__
may be faulty:
x=Point(1.0,2.0)
y=Point(1.0,2.0)
z=Point(2.0,1.0)
a=Point(2.2,1.1)
print "x=y?", x==y # as expected...
print "x=a?", x==a # also
print "x=z?", x==z # is that what you expect?
Even though Point(1.0,2.0) == Point(2.0,1.0)
because they have the same distance to Point(0,0)
, hash(1.0,2.0) != hash(2.0,1.0)
. The result is that you will have two objects that compare the same hashed as different objects.
OK, I'm a "late responder here" so:
The problems with __hash__
and __eq__
as used here are well covered. I would say that Named tuple is a hugh pickup in performance here. The next fastest gain ( based on size mostly ) would be to use __slots__
. The dead givaway is when you identify that point has "only" the two attributes.
Depending on the actual usage patters I would probably either:
- Add the Region attribute to point if you are creating the points and regions at the same time. Then the Point "knows" it's "parent"....(e.g.
return self.region
in Point) - namedtuples should speed things up for this size problem but if you needed to scale into bigger numbers, you could always cheat and use sqlite3 (in memory or not) or some other dbm library and let someone else worry about how to scale the size infinitely
Have you considered using Numpy arrays? The numpy individual elements are uniform rather than having the overhead of individual Python objects. Way faster and much more memory efficient.
精彩评论