I'm trying to write some Python code that includes union/intersection of sets that potentially can be very large. Much of the time, these sets will be essentially set(xrange(1<<32))
or something of the kind, but often there will be ranges of values that do not belong in the set (say, 'bit 5 cannot be clear'), or extra values thrown in. For the most part, the set contents can be expressed algorithmically.
I can go in and do the dirty work开发者_如何学编程 to subclass set
and create something, but I feel like this must be something that's been done before, and I don't want to spend days on wheel reinvention.
Oh, and just to make it harder, once I've created the set, I need to be able to iterate over it in random order. Quickly. Even if the set has a billion entries. (And that billion-entry set had better not actually take up gigabytes, because I'm going to have a lot of them.)
Is there code out there? Anyone have neat tricks? Am I asking for the moon?
You say:
For the most part, the set contents can be expressed algorithmically.
How about writing a class which presents the entire set API, but determines set inclusion algorithmically. Then with a number of classes which wrap around other sets to perform the union and intersection algorithmically.
For example, if you had a set a
and set b
which are instances of these pseudo sets:
>>> u = Union(a, b)
And then you use u
with the full set API, which will turn around and query a
and b
using the correct logic. All the set methods could be designed to return these pseudo unions/intersections automatically so the whole process is transparent.
Edit: Quick example with a very limited API:
class Base(object):
def union(self, other):
return Union(self, other)
def intersection(self, other):
return Intersection(self, other)
class RangeSet(Base):
def __init__(self, low, high):
self.low = low
self.high = high
def __contains__(self, value):
return value >= self.low and value < self.high
class Union(Base):
def __init__(self, *sets):
self.sets = sets
def __contains__(self, value):
return any(value in x for x in self.sets)
class Intersection(Base):
def __init__(self, *sets):
self.sets = sets
def __contains__(self, value):
return all(value in x for x in self.sets)
a = RangeSet(0, 10)
b = RangeSet(5, 15)
u = a.union(b)
i = a.intersection(b)
print 3 in u
print 7 in u
print 12 in u
print 3 in i
print 7 in i
print 12 in i
Running gives you:
True
True
True
False
True
False
You are trying to make a set containing all the integer values in from 0 to 4,294,967,295. A byte is 8 bits, which gets you to 255. 99.9999940628% of your values are over one byte in size. A crude minimum size for your set, even if you are able to overcome the syntactic issues, is 4 billion bytes, or 4 GB.
You are never going to be able to hold an instance of that set in less than a GB of memory. Even with compression, it's likely to be a tough squeeze. You are going to have to get much more clever with your math. You may be able to take advantage of some properties of the set. After all, it's a very special set. What you are trying to do?
If you are using python 3.0, you can subclass collections.Set
This sounds like it might overlap with linear programming. In linear programming you are trying to find some optimal case where you add constraints to a set of values (typically integers) which initially van be very large. There are various libraries listed at http://wiki.python.org/moin/NumericAndScientific/Libraries that mention integer and linear programming, but nothing jumps out as being obviously what you want.
I would avoid subclassing set
, since clearly you can usefully reuse no part of set
's implementation. I would even avoid subclassing collections.Set
, since the latter requires you to supply a __len__
-- a functionality which you appear not to need otherwise, and just can't be done effectively in the general case (it's going to be O(N)
, with, which the kind of size you're talking about, is far too slow). You're unlikely to find some existing implementation that matches your use case well enough to be worth reusing, because your requirements are very specific and even peculiar -- the concept of "random iterating and an occasional duplicate is OK", for example, is a really unusual one.
If your specs are complete (you only need union, intersection, and random iteration, plus occasional additions and removals of single items), implementing a special purpose class that fills those specs is not a crazy undertaking. If you have more specs that you have not explicitly mentioned, it will be trickier, but it's hard to guess without hearing all the specs. So for example, something like:
import random
class AbSet(object):
def __init__(self, predicate, maxitem=1<<32):
# set of all ints, >=0 and <maxitem, satisfying the predicate
self.maxitem = maxitem
self.predicate = predicate
self.added = set()
self.removed = set()
def copy(self):
x = type(self)(self.predicate, self.maxitem)
x.added = set(self.added)
x.removed = set(self.removed)
return x
def __contains__(self, item):
if item in self.removed: return False
if item in self.added: return True
return (0 <= item < self.maxitem) and self.predicate(item)
def __iter__(self):
# random endless iteration
while True:
x = random.randrange(self.maxitem)
if x in self: yield x
def add(self, item):
if item<0 or item>=self.maxitem: raise ValueError
if item not in self:
self.removed.discard(item)
self.added.add(item)
def discard(self, item):
if item<0 or item>=self.maxitem: raise ValueError
if item in self:
self.removed.add(item)
self.added.discard(item)
def union(self, o):
pred = lambda v: self.predicate(v) or o.predicate(v),
x = type(self)(pred, max(self.maxitem, o.maxitem))
toadd = [v for v in (self.added|o.added) if not pred(v)]
torem = [v for v in (self.removed|o.removed) if pred(v)]
x.added = set(toadd)
x.removed = set(torem)
def intersection(self, o):
pred = lambda v: self.predicate(v) and o.predicate(v),
x = type(self)(pred, min(self.maxitem, o.maxitem))
toadd = [v for v in (self.added&o.added) if not pred(v)]
torem = [v for v in (self.removed&o.removed) if pred(v)]
x.added = set(toadd)
x.removed = set(torem)
I'm not entirely certain about the logic determining added and removed upon union and intersection, but I hope this is a good base for you to work from.
精彩评论