the official name of this programming approach to compute the union and the intersection_问答_开发者

I [surely re] invented this [wheel] when I wanted to compute the union and the intersection and diff of two sets (stored as lists) at the same time. Initial code (not the tightest):

dct = {}
for a in lst1:
  dct[a] = 1
for b in lst2:
  if b in dct:
    dct[b] -= 1
  else:
    dct[b] = -1

union = [k for k in dct]
inter = [k for k in dct if dct[k] == 0]
oneminustwo =开发者_如何学运维 [k for k in dct if dct[k] ==  1]
twominusone = [k for k in dct if dct[k] ==  -1]

Then I realized that I should use 00, 01, 10, and 11 instead of -1, 1, 0, ... So, a bit at position n indicates membership in set n.

This can be generalized to up to 32 sets using an 32-bit int, or to any number of sets using a bitarray, or a string. So, you pre-compute this dictionary once, and then use very fast O(n) queries to extract elements of interest. For instance, all 1s means intersection of all sets. All 0s is a special one - will not occur.

Anyhow, this is not to toot my own horn. This surely was invented before and has a name. What is it called? Is this approach used in databases somewhere?

Using an N bit integer to represent N booleans is a special case of the data structure known as a perfect hash table. Notice you're explicitly using dicts (which are general hash tables) in the idea that prompted you to think about bitsets. It's a hash table because you use hashes to find a value, and it's perfect because you never have collisions. The special case is because of how the table is packed and stored.

Formulate the hash function, which shows how it's different from an array:

int bitset_hash(int n) {
  // domain of this function is only non-negative ints
  return 1 << n;
}

Notice bitset_hash(3) is 0b1000, which corresponds to the 4th item (offset/index 3) when using a C int and bitwise operations. (Because of the storage implementation detail, bitwise operations are also used to manipulate a specific item from the hash.)

Extending the approach to use bitwise-and/-or/-xor for set operations is common, and doesn't require any special name other than "set operations" or, if you need a buzzword, "set theory".

Finally, here's another example use of it in a prime sieve (I've used this code on Project Euler solutions):

class Sieve(object):
  def __init__(self, stop):
    self.stop = stop
    self.data = [0] * (stop // 32 // 2 + 1)
    self.len = 1 if stop >= 2 else 0
    for n in xrange(3, stop, 2):
      if self[n]:
        self.len += 1
        for n2 in xrange(n * 3, stop, n * 2):
          self[n2] = False

  def __getitem__(self, idx):
    assert idx >= 2
    if idx % 2 == 0:
      return idx == 2
    int_n, bit_n = divmod(idx // 2, 32)
    return not bool(self.data[int_n] & (1 << bit_n))

  def __setitem__(self, idx, value):
    assert idx >= 2 and idx % 2 != 0
    assert value is False
    int_n, bit_n = divmod(idx // 2, 32)
    self.data[int_n] |= (1 << bit_n)

  def __len__(self):
    return self.len

  def __iter__(self):
    yield 2
    for n in xrange(3, self.stop, 2):
      if self[n]:
        yield n

Yes, it is sometimes used in databases, for example PostgreSQL. As mentions Wikipedia:

Some database systems that do not offer persistent bitmap indexes use bitmaps internally to speed up query processing. For example, PostgreSQL versions 8.1 and later implement a "bitmap index scan" optimization to speed up arbitrarily complex logical operations between available indexes on a single table.

(from http://en.wikipedia.org/wiki/Bitmap_index)

It's very common to use an integer to represent a set of small integers; it's often called a bitset or bitvector. Here you're using an integer to represent "the set of input sequences that contain this value".

The operation you're doing reminds me of reversing a multimap.

In your case, the input is a list of lists:

[["a", "b"], ["a", "c", "d"]]

But you could instead think of it as a bag of ordered pairs, like this:

0, "a"
0, "b"
1, "a"
1, "c"
1, "d"

You're simply constructing a table that contains the reversed pairs

"a", 0
"b", 0
"a", 1
"c", 1
"d", 1

which looks like this:

{"a": [0, 1],
 "b": [0],
 "c": [1],
 "d": [1]}

and you happen to be representing those arrays of integers using bitvectors.

The original data structure (the list of lists) made it easy to iterate over all values for a given list. The reversed data structure (the dictionary of lists) makes it easy to find all lists that have a given value.

Is the idea of a bit field what you're looking for? Every bit of your ... field (for lack of a better word) represents a flag. In this case, your flag is membership in set N.

Edit - I think I misunderstood which idea you were referring to. Disregard?