How to keep only duplicates efficiently?_问答_开发者

Given an STL vector, output only the duplicates in sorted order, e.g.,

INPUT : { 4, 4, 1, 2, 3, 2, 3 }
OUTPUT: { 2, 3, 4 }

The algorithm is trivial, but the goal is to make it as efficient as std::unique(). My naive implementation modifies the container in-place:

My naive implementation:

void not_unique(vector<int>* pv)
{
    if (!pv)
        return;

 // Sort (in-place) so we can find duplicates in linear time
 sort(pv->begin(), pv->end());

 vector<int>::iterator it_start = pv->begin();
 while (it_start != pv->end())
 {
  size_t nKeep = 0;

  // Find the next different element
  vector<int>::iterator it_stop = it_start + 1;
  while (it开发者_开发技巧_stop != pv->end() && *it_start == *it_stop)
  {
   nKeep = 1; // This gets set redundantly
   ++it_stop;
  }

  // If the element is a duplicate, keep only the first one (nKeep=1).
  // Otherwise, the element is not duplicated so erase it (nKeep=0).
  it_start = pv->erase(it_start + nKeep, it_stop);
 }
}

If you can make this more efficient, elegant, or general, please let me know. For example, a custom sorting algorithm, or copy elements in the 2nd loop to eliminate the erase() call.

I miserably failed with my first attempt, assuming that std::unique moves all the duplicates to the end of the range (it doesn't). Oops. Here's another attempt:

Here is an implementation of not_unique. It removes any elements that appear only once in the sorted range and duplicates of any elements that appear more than once. The resulting range is therefore the unique list of elements that appear more than once.

The function has linear complexity and does a single pass over the range (std::unique has linear complexity). It must meet the requirements of a forward iterator. The end of the resulting range is returned.

template <typename It>
It not_unique(It first, It last)
{
    if (first == last) { return last; }

    It new_last = first;
    for (It current = first, next = ++first; next != last; ++current, ++next)
    {
        if (*current == *next)
        {
            if (current == new_last)
            {
                ++new_last;
            }
            else
            {
                *new_last++ = *current;
                while (next != last && *current == *next)
                {
                    ++current;
                    ++next;
                }
                if (next == last)
                    return new_last;
            }
        }
    }
    return new_last;
}

Shorter and more STL-ish than previous entries. Assumes sorted input.

#include <algorithm>
#include <functional>

template< class I, class P >
I remove_unique( I first, I last, P pred = P() ) {
    I dest = first;
    while (
        ( first = std::adjacent_find( first, last, pred ) )
            != last ) {
        * dest = * first;
        ++ first;
        ++ dest;
        if ( ( first = std::adjacent_find( first, last, std::not2( pred ) ) )
            == last ) break;
        ++ first;
    }
    return dest;
}

template< class I >
I remove_unique( I first, I last ) {
    return remove_unique( first, last,
        std::equal_to< typename std::iterator_traits<I>::value_type >() );
}

You can even use mismatch, for extra points!
Btw: nice exercise.

template<class TIter>
/** Moves duplicates to front, returning end of duplicates range.
 *  Use a sorted range as input. */
TIter Duplicates(TIter begin, TIter end) {
    TIter dup = begin;
    for (TIter it = begin; it != end; ++it) {
        TIter next = it;
        ++next;
        TIter const miss = std::mismatch(next, end, it).second;
        if (miss != it) {
            *dup++ = *miss;
            it = miss;
        }
    }
    return dup;
}

My suggestion would be a modified insertion sort, so that you can sort & filter dupes at the same time.

This is in the style of the standard library. Credit for algorithm goes to James! (If you +1 me, you better +1 him, or else). All I did was make it standard library style:

#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <vector>

// other stuff (not for you)
template <typename T>
void print(const char* pMsg, const T& pContainer)
{
    std::cout << pMsg << "\n    ";
    std::copy(pContainer.begin(), pContainer.end(),
        std::ostream_iterator<typename T::value_type>(std::cout, " "));
    std::cout << std::endl;
}

template <typename T, size_t N>
T* endof(T (&pArray)[N])
{
    return &pArray[0] + N;
}

// not_unique functions (for you)
template <typename ForwardIterator, typename BinaryPredicate>
ForwardIterator not_unique(ForwardIterator pFirst, ForwardIterator pLast,
                           BinaryPredicate pPred)
{
    // correctly handle case where an empty range was given:
    if (pFirst == pLast) 
    { 
        return pLast; 
    }

    ForwardIterator result = pFirst;
    ForwardIterator previous = pFirst;

    for (++pFirst; pFirst != pLast; ++pFirst, ++previous)
    {
        // if equal to previous
        if (pPred(*pFirst, *previous))
        {
            if (previous == result)
            {
                // if we just bumped bump again
                ++result;
            }
            else if (!pPred(*previous, *result))
            {
                // if it needs to be copied, copy it
                *result = *previous;

                // bump
                ++result;
            }
        }
    }

    return result;
}

template <typename ForwardIterator>
ForwardIterator not_unique(ForwardIterator pFirst, ForwardIterator pLast)
{
    return not_unique(pFirst, pLast,
                std::equal_to<typename ForwardIterator::value_type>());
}


//test
int main()
{
    typedef std::vector<int> vec;

    int data[] = {1, 4, 7, 7, 2, 2, 2, 3, 9, 9, 5, 4, 2, 8};
    vec v(data, endof(data));

    // precondition
    std::sort(v.begin(), v.end());
    print("before", v);

    // duplicatify (it's a word now)
    vec::iterator iter = not_unique(v.begin(), v.end());
    print("after", v);

    // remove extra
    v.erase(iter, v.end());
    print("erased", v);
}

Calling "erase(it_start + keep, it_stop);" from within the while loop is going to result in copying the remaining elements over and over again.

I'd suggest swapping all unique elements to the front of the vector, then erasing the remaining elements all at once.

int num_repeats(vector<int>::const_iterator curr, vector<int>::const_iterator end) {
  int same = *curr;
  int count = 0;
  while (curr != end && same == *curr) {
    ++curr;
    ++count;
  }
  return count;
}

void dups(vector<int> *v) {
  sort(v->begin(), v->end());
  vector<int>::iterator current = v->begin();
  vector<int>::iterator end_of_dups = v->begin();
  while (current != v->end()) {
    int n = num_repeats(current, v->end());
    if (n > 1) {
      swap(*end_of_dups, *current);
      end_of_dups++;
    }
    current += n;
  }
  v->erase(end_of_dups, v->end());
}

I think that from a big O standpoint, you have it implemented as good as it gets. The overriding cost is the sort, which is O(N log N). One possibility, though, would be to build up a new vector with the duplicate entries rather than use the existing vector with the delete operation removing the non-duplicates. However, this would only be better if the distinct number of duplicates is small relative to the total number of entries.

Consider the extreme example. If the original array consisted of 1,000 entries with only one duplicate, then the output would be a vector with just one value. It might be a bit more efficient to create the new vector with one entry rather than deleting the other 999 entries from the original vector. I suspect, however, that in real world testing, the savings of that change could be difficult to measure.

Edit I was just thinking about this in terms of "interview" question. In other words, this is not a terribly useful answer. But it would be possible to solve this in O(N) (linear time) as opposed to O(N Log N). Use storage space instead of CPU. Create two "bit" arrays with them cleared initially. Loop through your vector of integer values. Look up each value in the first bit array. If it is not set, then set the bit (set it to 1). If it is set, then set the corresponding bit in the second array (indicating a duplicate). After all vector entries are processed, scan through the second array and output the integers that are duplicates (indicated by the bits set in the second bit array). The reason for using bit arrays is just for space efficiency. If dealing with 4-byte integers, then the raw space required is (2 * 2^32 / 8 ). But this could actually be turned into a usable algorithm by making it a sparse array. The very pseudo pseudo-code would be something like this:

bitarray1[infinite_size];
bitarray2[infinite_size];

clear/zero bitarrays

// NOTE - do not need to sort the input
foreach value in original vector {
    if ( bitarray1[value] ) 
        // duplicate
        bitarray2[value] = 1
    bitarray1[value] = 1
}

// At this point, bitarray2 contains a 1 for all duplicate values.
// Scan it and create the new vector with the answer
for i = 0 to maxvalue
    if ( bitarray2[i] )
        print/save/keep i

Another one:

template <typename T>
void keep_duplicates(vector<T>& v) 
{
    set<T> 
        u(v.begin(), v.end()), // unique 
        d; // duplicates
    for (size_t i = 0; i < v.size(); i++)
        if (u.find(v[i]) != u.end())
            u.erase(v[i]);
        else
            d.insert(v[i]);

    v = vector<T>(d.begin(), d.end());
}

This fixes the bugs in James McNellis's original version. I also provide in-place and out-of-place versions.

// In-place version.  Uses less memory and works for more container
// types but is slower.
template <typename It>
It not_unique_inplace(It first, It last)
{
    if (first == last)
        return last;

    It new_last = first;
    for (It current = first, next = first + 1; next != last; ++current, ++next)
    {
        if (*current == *next && 
            (new_last == first || *current != *(new_last-1)))
            *new_last++ = *current;
    }
    return new_last;
}

// Out-of-place version. Fastest.
template <typename It, typename Container>
void not_unique(It first, It last, Container pout)
{
    if (first == last || !pout)
        return;

    for (It current = first, next = first + 1; next != last; ++current, ++next)
    {
        if (*current == *next && 
            (pout->empty() || *current != pout->back()))
            pout->push_back(*current);
    }
}

What is meant by "as efficient as std::unique"? Efficient in terms of runtime, development time, memory usage, or what?

As others pointed out, std::unique requires sorted input, which you haven't provided, so it's not a fair test to begin with.

Personally I would just have a std::map do all of my work for me. It has a lot of properties we can use for maximal elegance/brevity. It keeps its elements sorted already, and operator[] will insert a zero value if the key doesn't already exist. By leveraging those properties, we can get this done in two or three lines of code, and still achieve reasonable runtime complexity.

Basically, my algorithm is this: For each element in the vector, increment by one the map entry keyed by the value of that element. Afterwards, simply walk the map, outputting any key whose value is more than 1. Couldn't be simpler.

#include <iostream>
#include <vector>
#include <map>

void
output_sorted_duplicates(std::vector<int>* v)
{
   std::map<int, int> m;  

   // count how many of each element there are, putting results into map
   // map keys are elements in the vector, 
   // map values are the frequency of that element
   for (std::vector<int>::iterator vb = v->begin(); vb != v->end(); ++vb)
      ++m[*vb];

   // output keys whose values are 2 or more
   // the keys are already sorted by the map
   for (std::map<int, int>::iterator mb = m.begin(); mb != m.end(); ++mb)
      if ( (*mb).second >= 2 ) 
         std::cout << (*mb).first << " "; 
   std::cout << std::endl;
}

int main(void) 
{ 
   int initializer[] = { 4, 4, 1, 2, 3, 2, 3 };
   std::vector<int> data(&initializer[0], &initializer[0] + 7);
   output_sorted_duplicates(&data);
}

janks@phoenix:/tmp$ g++ test.cc && ./a.out
2 3 4

So, we visit each element in your vector once, and then each element in my map once, where the number of elements in my map is at worst no bigger than your vector. The drawbacks to my solution are a lot more storage space than the solutions that involve rearranging your vector in-place. The advantages, however, are clear. It's incredibly short and simple, it's obviously correct without the need for much testing or code review, and it has reasonable performance properties.

Making my function a template, and making it operate on STL-style ranges instead of just vectors of ints, is left as an exercise.