开发者

The fastest way to do a collection subtraction

开发者 https://www.devze.com 2022-12-22 01:33 出处:网络
I have two Sets. Set bis the subset of Set a. they\'re both very huge Sets. I want to subtract b from a, what\'s the best practice to do this common operation ?

I have two Sets. Set b is the subset of Set a. they're both very huge Sets. I want to subtract b from a , what's the best practice to do this common operation ? I've written to many codes like this , and I don't think it's efficient. what's your idea ?

pseudo code : (this is not Java API) .

for(int i = 0 ; i < a.size(); i++) {
          for (int j=0 ; j < b.size() ;j++) {
              // do comparison , if found equals ,remove from a
              break;
          }
 }

And I want to find an algorithm , not only applies to Sets, also works for Array.

EDIT: The Set here is not JAVA API , it's a data structure . so I don't care if Java API has a removeAll开发者_运维知识库() method , I want find a common solution for this problem , I've encounter a lot of problems like this when I use Javascript and Actionscript .


I don't think you will get it much faster, but your code will look simpler and will not become slower by a.removeAll(b);. removeAll() is part of the Java-API.

For efficiency analysis: Your given code example is O(n^2), that scales not very good, but also isn't the horriblest thing on earth (exponential complexity is the thing you don't want). As long as you don't know the internal organisation of the data in the Collection, you will not get a better performance. removeAll() is implemented by the class itself and knows about the internal organisation. So if the data is organized in a Hash, you may get better results, if the data is organized in an unsorted array, the complexity will be the same. A Set have to efficently lookup if a new item is already in the set, so I suspect some sort of Hash as internal representation, especially if the implementation is called HashSet. :-)

EDIT: The OP changed it's question to mention it is not only for Java. removeAll() is a Java-API, so this (or something similar) may not be available in other languages. As said before, if the collections are unsorted arrays with no other restrictions the two for-loops are already the fastest solution. But if the data is organized different you have faster options. If the two collections are sorted data (in my example comes the smallest element first), you can do the following (reducing the complexity to O(n)):

int bIndex = 0;
for(int i = 0 ; i < a.size(); i++) {
          while (a[i] < b[bIndex]) {bIndex++;}
          if (a[i] == b[bIndex]) {markForRemoval(a[i]);} // I mark this only for removal, as the actual removal would make your index incorrect
}

If the data is organized as a hash in both collections you also need only one for-loop, accessing directly the element in b. Other possible organizations of data are possible.


In the end, there's not much of a choice other than to one by one compare elements and remove the ones which are in both.

To do it another way, you'd have to do something fancy like give all set members a unique value index, and construct a huge array of booleans representing each set, and then you could do bit operations to subtract B from A. I have no idea if that would be faster, given the overhead of creating unique value indices and manipulating the very large bitmasks.

I know you don't care about a Java solution, but since other people have recommended removeAll(), I'd like to point out that it's still doing essentially the same thing under the covers. Check the source for HashSet.


If the sets are maintained such that the elements are available at any given time in sorted order, then you can perform a single linear pass over both sets and create the difference in O(n) time. Now, again, that's if you can get at the ordered lists of elements for free — which is to say that the maintenance (i.e., add-element and remove-element operations) of the sets pays the cost of keeping the elements available in sorted order.

Any sort of "removeAll" operation that relies on performing lookups is necessarily going to be worse than O(n).

(It occurs to me that the construction of the difference set — that is, the answer constructed from the linear pass over the two lists — could be O(n log n) if you're not extremely careful.)


Given that b is a subset of a I'm not sure why your pseudo-code has 2 loops. Mine would simply be:

foreach b in B
    remove b from A

In practice how the run time of this compares with the run time of yours depends on, among other things, how you have implemented the set as a data structure.


well, the correct idea was already pointed out: the set should be implemented using a hash. hashes ideally have O(1) access cost, so you can get O(min(m,n)) cost for the overall operation assuming you can determine which set is bigger (like maintaining a counter during insert/remove operations).

in actionscript 3, you'd use a Dictionary. just use elements as keys and values.

removing looks like this:

for each (var key:* in set2) {//a simple for-in loop will also do the trick, since keys and values are equal, but for-each-in loops perform faster
    delete set1[key];
}

in JavaScript, you will need to give the entries ids when inserting, so you can use those ids as keys in a map. simply map ids to original values.

removing looks like this:

for (var key in set2) {
    delete set1[key];
}


The operation as you are writing it is O(N^2), but if the sets are large, you might want to use a hash.

// A is some kind of array, O(1) iteration
// B is a hash containing elements to remove, O(1) contains(elt)
List<T> removeAll(List<T> A, Set<T> B) {
  List<T> result; // empty, could preallocate at |A|
  for (elt : A) { // for each 'elt' belonging to A, hence O(|A|)
    if (! B.contains(elt) ) { // O(1) thanks to hash
      C.add(elt) ; // ensure this is O(1) with preallocation or linked list
    }
  }
  return result;
}

This requires indexing the set B, so you need a hash function. In Java, you could use Set<T> Bh = new HashSet<T>(B); which is O(|B|) in time and memory. So overall we get O(|A|+|B|) in time and roughly O(2|A|+2|B|)) in memory. Sure beats the quadratic of removeAll, you will feel the difference (TM).

It is probably better to copy elements into a new array (as done in the pseudo code), since removing elements from A directly could lead to overhead if you keep elements in order (left shifting elements in A is costly).


You have seen the removeAll method in the Set interface?

Also check out this stack overflow question.


I believe you'll find java.util.HashSet.removeAll(Collection toRemove) to perform well. On the other hand, if you don't have sets but sorted collections, you might be able to do a lot better.

0

精彩评论

暂无评论...
验证码 换一张
取 消