Algorithm - How to delete duplicate elements in a list efficiently?_问答_开发者

There is a list L. It contains elements of arbitrary type each. How to delete all duplicate elements in such list efficiently? ORDER must be preserved

Just an algorithm is required, so no import any external library is allowed.

Special Case: Hashing and Equality

Firstly, we need to determine something about the assumptions, namely the existence of an equals and has function relationship. What do I mean by this? I mean that for the set of source objects S, given any two objects x1 and x2 that are elements of S there exists a (hash) function F such that:

if (x1.equals(x2)) then F(x1) == F(x2)

Java has such a relationship. That allows you to check to duplicates as a near O(1) operation and thus reduces the algorithm to a simple O(n) problem. If order is unimportant, it's a simple one liner:

List result = new ArrayList(new HashSet(inputList));

If order is important:

List outputList = new ArrayList();
Set set = new HashSet();
for (Object item : inputList) {
  if (!set.contains(item)) {
    outputList.add(item);
    set.add(item);
  }
}

You will note that I said "near O(1)". That's because such data structures (as a Java HashMap or HashSet) rely on a method where a portion of the hash code is used to find an element (often called a bucket) in the backing storage. The number of buckets is a power-of-2. That way the index into that list is easy to calculate. hashCode() returns an int. If you have 16 buckets you can find which one to use by ANDing the hashCode with 15, giving you a number from 0 to 15.

When you try and put something in that bucket it may already be occupied. If so then a linear comparison of all entries in that bucket will occur. If the collision rate gets too high or you try to put too many elements in the structure will be grown, typically doubled (but always by a power-of-2) and all the items are placed in their new buckets (based on the new mask). Thus resizing such structures is relatively expensive.

Lookup may also be expensive. Consider this class:

public class A {
  private final int a;

  A(int a) { this.a == a; }

  public boolean equals(Object ob) {
    if (ob.getClass() != getClass()) return false;
    A other = (A)ob;
    return other.a == a;
  }

  public int hashCode() { return 7; }
}

This code is perfectly legal and it fulfills the equals-hashCode contract.

Assuming your set contains nothing but A instances, your insertion/search now turns into an O(n) operation, turning the entire insertion into O(n²).

Obviously this is an extreme example but it's useful to point out that such mechanisms also rely on a relatively good distribution of hashes within the value space the map or set uses.

Finally, it must be said that this is a special case. If you're using a language without this kind of "hashing shortcut" then it's a different story.

General Case: No Ordering

If no ordering function exists for the list then you're stuck with an O(n²) brute-force comparison of every object to every other object. So in Java:

List result = new ArrayList();
for (Object item : inputList) {
  boolean duplicate = false;
  for (Object ob : result) {
    if (ob.equals(item)) {
      duplicate = true;
      break;
    }
  }
  if (!duplicate) {
    result.add(item);
  }
}

General Case: Ordering

If an ordering function exists (as it does with, say, a list of integers or strings) then you sort the list (which is O(n log n)) and then compare each element in the list to the next (O(n)) so the total algorithm is O(n log n). In Java:

Collections.sort(inputList);
List result = new ArrayList();
Object prev = null;
for (Object item : inputList) {
  if (!item.equals(prev)) {
    result.add(item);
  }
  prev = item;
}

Note: the above examples assume no nulls are in the list.

If the order does not matter, you might want to try this algorithm written in Python:

>>> array = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6]
>>> unique = set(array)
>>> list(unique)
[1, 2, 3, 4, 5, 6]

in haskell this would be covered by the nub and nubBy functions

nub :: Eq a => [a] -> [a]
nub [] = []
nub (x:xs) = x : nub (filter (/= x) xs)

nubBy :: (a -> a -> Bool) -> [a] -> [a]
nubBy f [] = []
nubBy f (x:xs) = x : nub (filter (not.f x) xs)

nubBy relaxes the dependence on the Eq typeclass, instead allowing you to define your own equality function to filter duplicates.

These functions work over a list of consistent arbitrary types (e.g. [1,2,"three"] is not allowed in haskell), and they are both order preserving.

In order to make this more efficient, using Data.Map (or implementing a balanced tree) could be used to gather the data into a set (key being the element, and value being the index into the original list in order to be able to get the original ordering back), then gathering the results back into a list and sorting by index. I will try and implement this later.

import qualified Data.Map as Map

undup x = go x Map.empty
    where
        go [] _ = []
        go (x:xs) m case Map.lookup x m of
                         Just _  -> go xs m
                         Nothing -> go xs (Map.insert x True m)

This is a direct translation of @FogleBird's solution. Unfortunately it doesn't work without the import.

a Very basic attempt at replacing Data.Map import would be to implement a tree, something like this

data Tree a = Empty
            | Node a (Tree a) (Tree a)
            deriving (Eq, Show, Read)

insert x Empty = Node x Empty Empty
insert x (Node a left right)
    | x < a = Node a (insert x left) right
    | otherwise = Node a left (insert x right)

lookup x Empty = Nothing --returning maybe type to maintain compatibility with Data.Map
lookup x (Node a left right)
    | x == a = Just x
    | x < a = lookup x left
    | otherwise = lookup x right

an improvement would be to make it autobalancing on insert by maintaining a depth attribute (keeps the tree from degrading into a linked list). This nice thing about this over a hash table is that it only requires your type to be in the typeclass Ord, which is easily derivable for most types.

I take requests it seems. In response to @Jonno_FTWs inquiry here is a solution which completely removes duplicates from the result. It's not entirely dissimilar to the original, simply adding an extra case. However the runtime performance will be much slower since you are going through each sub-list twice, once for the elem, and the second time for the recusion. Also note that now it will not work on infinite lists.

nub [] = []
nub (x:xs) | elem x xs = nub (filter (/=x) xs)
           | otherwise = x : nub xs

Interestingly enough you don't need to filter on the second recursive case because elem has already detected that there are no duplicates.

In Python

>>> L = [2, 1, 4, 3, 5, 1, 2, 1, 1, 6, 5]
>>> a=[]
>>> for i in L:
...   if not i in a:
...     a.append(i)
...
>>> print a
[2, 1, 4, 3, 5, 6]
>>>

In java, it's a one liner.

Set set = new LinkedHashSet(list);

will give you a collection with duplicate items removed.

For Java could go with this:

private static <T> void removeDuplicates(final List<T> list)
{
    final LinkedHashSet<T> set;

    set = new LinkedHashSet<T>(list); 
    list.clear(); 
    list.addAll(set);
}

Delete duplicates in a list inplace in Python

Case: Items in the list are not hashable or comparable

That is we can't use set (dict) or sort.

from itertools import islice

def del_dups2(lst):
    """O(n**2) algorithm, O(1) in memory"""
    pos = 0
    for item in lst:
        if all(item != e for e in islice(lst, pos)):
            # we haven't seen `item` yet
            lst[pos] = item
            pos += 1
    del lst[pos:]

Case: Items are hashable

Solution is taken from here:

def del_dups(seq):
    """O(n) algorithm, O(log(n)) in memory (in theory)."""
    seen = {}
    pos = 0
    for item in seq:
        if item not in seen:
            seen[item] = True
            seq[pos] = item
            pos += 1
    del seq[pos:]

Case: Items are comparable, but not hashable

That is we can use sort. This solution doesn't preserve original order.

def del_dups3(lst):
    """O(n*log(n)) algorithm, O(1) memory"""
    lst.sort()
    it = iter(lst)
    for prev in it: # get the first element 
        break
    pos = 1 # start from the second element
    for item in it: 
        if item != prev: # we haven't seen `item` yet
            lst[pos] = prev = item
            pos += 1
    del lst[pos:]

go through the list and assign sequential index to each item
sort the list basing on some comparison function for elements
remove duplicates
sort the list basing on assigned indices

for simplicity indices for items may be stored in something like std::map

looks like O(n*log n) if I haven't missed anything

It depends on what you mean by "efficently". The naive algorithm is O(n^2), and I assume what you actually mean is that you want something of lower order than that.

As Maxim100 says, you can preserve the order by pairing the list with a series of numbers, use any algorithm you like, and then resort the remainder back into their original order. In Haskell it would look like this:

superNub :: (Ord a) => [a] -> [a]
superNub xs = map snd 
              . sortBy (comparing fst) 
              . map head . groupBy ((==) `on` snd) 
              . sortBy (comparing snd) 
              . zip [1..] $ xs

Of course you need to import Data.List (sort), Data.Function (on) and Data.Ord (comparing). I could just recite the definitions of those functions, but what would be the point?

I've written an algorithm for string. Actually it does not matter what type do you have.

static string removeDuplicates(string str)
{
    if (String.IsNullOrEmpty(str) || str.Length < 2) {
        return str;
    }

    char[] arr = str.ToCharArray();
    int len = arr.Length;
    int pos = 1;

    for (int i = 1; i < len; ++i) {

        int j;

        for (j = 0; j < pos; ++j) {
            if (arr[i] == arr[j]) {
                break;
            }
        }

        if (j == pos) {
            arr[pos] = arr[i];
            ++pos;
        }
    }

    string finalStr = String.Empty;
    foreach (char c in arr.Take(pos)) {
        finalStr += c.ToString();
    }

    return finalStr;
}

One line solution in Python.
Using lists-comprehesion:

>>> L = [2, 1, 4, 3, 5, 1, 2, 1, 1, 6, 5]
>>> M = []
>>> zip(*[(e,M.append(e)) for e in L if not e in M])[0]
(2, 1, 4, 3, 5, 6)

Maybe you should look into using associate arrays (aka dict in python) to avoid having duplicate elements in the first place.

My code in Java:

ArrayList<Integer> list = new ArrayList<Integer>();

list.addAll({1,2,1,3,4,5,2,3,4,3});

for (int i=0; i<list.size(); i++)
{
    for (int j=i+1; j<list.size(); j++)
    {
        if (list.get(i) == list.get(j))
        {
            list.remove(i);
            j--;
        }
    }
}

or simply do this:

SetList<Integer> unique = new SetList<Integer>();

unique.addAll(list);

Both ways have Time = nk ~ O(n^2)

where n is the size of input list,

k is number of unique members of the input list

Algorithm delete_duplicates (a[1....n])

//Remove duplicates from the given array

//input parameters :a[1:n], an array of n elements

{

temp[1:n]; //an array of n elements

 temp[i]=a[i];for i=1 to n

     temp[i].value=a[i]

        temp[i].key=i

*//based on 'value' sort the array temp.*

//based on 'value' delete duplicate elements from temp.

//based on 'key' sort the array temp.//construct an array p using temp.

p[i]=temp[i].value

return p

In other of elements is maintained in the output array using the 'key'. Consider the key is of length O(n), the time taken for performing sorting on the key and value is O(nlogn). So the time taken to delete all duplicates from the array is O(nlogn).

Generic solution close to the accepted answer

k = ['apple', 'orange', 'orange', 'grapes', 'apple', 'apple', 'apple']
m = []


def remove_duplicates(k):
    for i in range(len(k)):
        for j in range(i, len(k)-1):
            if k[i] == k[j+1]:
                m.append(j+1)

    l = list(dict.fromkeys(m))
    l.sort(reverse=True)

    for i in l:
        k.pop(i)

    return k


print(remove_duplicates(k))