std::vector differences_问答_开发者_运维开发者技术经验分享

How does one determine what the differences of 2 vectors are?

I have vector<int> v1 and vector开发者_如何学JAVA<int> v2;

What I am looking for is a vector<int> vDifferences that contains only elements that are only in v1 or v2.

Is there a standard way to do this?

Here is the complete and correct answer. Before the set_symmetric_difference algorithm can be used, the source ranges must be ordered:

  using namespace std; // For brevity, don't do this in your own code...

  vector<int> v1;
  vector<int> v2;

  // ... Populate v1 and v2

  // For the set_symmetric_difference algorithm to work, 
  // the source ranges must be ordered!    
  vector<int> sortedV1(v1);
  vector<int> sortedV2(v2);

  sort(sortedV1.begin(),sortedV1.end());
  sort(sortedV2.begin(),sortedV2.end());

  // Now that we have sorted ranges (i.e., containers), find the differences    
  vector<int> vDifferences;

  set_symmetric_difference(
    sortedV1.begin(),
    sortedV1.end(),
    sortedV2.begin(),
    sortedV2.end(),
    back_inserter(vDifferences));

  // ... do something with the differences

It should be noted that sorting is an expensive operation (i.e., O(n log n) for common STL implementations). Especially for the case of one or both of the containers being very large (i.e., millions of integers or more), a different algorithm using hash tables may be preferable based on algorithmic complexity. Here is a high level description of this algorithm:

Load each container into a hash table.

If the two containers differ in size, the hash table corresponding to the smaller one will be used for traversal in Step 3. Otherwise, the first of the two hash tables will be used.

Traverse the hash table chosen in Step 2, checking to see if each item is present in both hash tables. If it is, remove it from both of them. The reason that the smaller hash table is preferred for traversal is because hash table lookups are on the average O(1) regardless of container size. Therefore, the time to traverse is a linear function of n (i.e., O(n)), where n is the size of the hash table being traversed.

Take the union of the remaining items in the hash tables and store the result in a difference container.

C++11 offers us some capability for such a solution by standardizing the unordered_multiset container. I also employed the new usage of the auto keyword for explicit initializations to make the following hash table based solution more concise:

using namespace std; // For brevity, don't do this in your own code...

// The remove_common_items function template removes some and / or all of the
// items that appear in both of the multisets that are passed to it. It uses the
// items in the first multiset as the criteria for the multi-presence test.
template <typename tVal>
void remove_common_items(unordered_multiset<tVal> &ms1, 
                         unordered_multiset<tVal> &ms2)
{
  // Go through the first hash table
  for (auto cims1=ms1.cbegin();cims1!=ms1.cend();)
  {
    // Find the current item in the second hash table
    auto cims2=ms2.find(*cims1);

    // Is it present?
    if (cims2!=ms2.end())
    {
      // If so, remove it from both hash tables
      cims1=ms1.erase(cims1);
      ms2.erase(cims2);
    }
    else // If not
      ++cims1; // Move on to the next item
  }
}

int main()
{
  vector<int> v1;
  vector<int> v2;

  // ... Populate v1 and v2

  // Create two hash tables that contain the values
  // from their respective initial containers    
  unordered_multiset<int> ms1(v1.begin(),v1.end());
  unordered_multiset<int> ms2(v2.begin(),v2.end());

  // Remove common items from both containers based on the smallest
  if (v1.size()<=v2.size)
    remove_common_items(ms1,ms2);
  else
    remove_common_items(ms2,ms1);

  // Create a vector of the union of the remaining items
  vector<int> vDifferences(ms1.begin(),ms1.end());

  vDifferences.insert(vDifferences.end(),ms2.begin(),ms2.end());

  // ... do something with the differences
}

In order to determine which solution is better for a particular situation, profiling both algorithms would be a smart course of action. Although the hash table based solution is in O(n), it requires more code and does more work per duplicate found (i.e., hash table removals). It also (sadly) uses a custom differencing function rather than a standard STL algorithm.

It should be noted that both solutions present the differences in an order that is most likely quite different from the order that the elements appeared in the original containers. There is a way around this by using a variant of the hash table solution. A high level description follows (which only differs in Step 4 from the preceding solution):

Load each container into a hash table.

If the two containers differ in size, the smaller hash table will be used for traversal in Step 3. Otherwise, the first of the two will be used.

Traverse the hash table chosen in Step 2, checking to see if each item is present in both hash tables. If it is, remove it from both of them.

To form the difference container, traverse the original containers in order (i.e., the first container before the second). Look up each item from each container in its respective hash table. If it is found, the item is to be added to the difference container and removed from its hash table. Items not present in the respective hash tables will be skipped. Thus, only the items that are present in the hash tables will wind up in the difference container and their order of appearance will remain the same as it was in the original containers, because those containers dictate the order of the final traversal.

In order to maintain original order, Step 4 has become more expensive than in the previous solution, especially if the number of items removed is high. This is because:

All items will be tested a second time for eligibility to appear in the difference container, via a presence test in their respective hash tables.
The hash tables will have the remainder of their items removed one at a time when the difference container is formed, as part of the present in the difference testing of Item 1.

Do you want elements from both v1 and v2 that are unique and not in the other sequence? That sounds like std::set_symmetric_difference to me.

Copies the elements of the range [first1,last1) that are not present in the range [first2, last2), and the elements of the range [first2,last2) that are not present in the range [first1, last1) to the range beginning at result. The elements in the constructed range are sorted.