开发者

What's the most efficient way to check for duplicates in an array of data using Perl?

开发者 https://www.devze.com 2023-01-03 10:53 出处:网络
I n开发者_JS百科eed to see if there are duplicates in an array of strings, what\'s the most time-efficient way of doing it?One of the things I love about Perl is it\'s ability to almost read like Engl

I n开发者_JS百科eed to see if there are duplicates in an array of strings, what's the most time-efficient way of doing it?


One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.

use strict;
use warnings;

my @array = qw/yes no maybe true false false perhaps no/;

my %seen;

foreach my $string (@array) {

    next unless $seen{$string}++;
    print "'$string' is duplicated.\n";
}

Output

'false' is duplicated.

'no' is duplicated.


Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.

#!/usr/bin/perl

use strict;
use warnings;

my %count;
my %dups;
for(@array) {
    $dups{$_}++ if $count{$_}++;
}

A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.

# not exactly sort in place, but Perl does a decent job optimizing it
@array = sort @array;

my $last;
my %dups;
for my $entry (@array) {
    $dups{$entry}++ if defined $last and $entry eq $last;
    $last = $entry;
}

This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.

Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.


If you need the uniquified array anyway, it is fastest to use the heavily-optimized library List::MoreUtils, and then compare the result to the original:

use strict;
use warnings;
use List::MoreUtils 'uniq';

my @array = qw(1 1 2 3 fibonacci!);
my @array_uniq = uniq @array;
print ((scalar(@array) == scalar(@array_uniq)) ? "no dupes" : "dupes") . " found!\n";

Or if the list is large and you want to bail as soon as a duplicate entry is found, use a hash:

my %uniq_elements;
foreach my $element (@array)
{
    die "dupe found!" if $uniq_elements{$element}++;
}


Create a hash or a set or use a collections.Counter().

As you encounter each string/input check to see if there's an instance of that in the hash. If so, it's a duplicate (do whatever you want about those). Otherwise add a value (such as, oh, say, the numeral one) to the hash, using the string as the key.

Example (using Python collections.Counter):

#!python
import collections
counts = collections.Counter(mylist)
uniq = [i for i,c in counts.iteritems() if c==1]
dupes = [i for i, c in counts.iteritems() if c>1]

These Counters are built around dictionaries (Pythons name for hashed mapping collections).

This is time efficient because hash keys are indexed. In most cases the lookup and insertion time for keys is done in near constant time. (In fact Perl "hashes" are so-called because they are implemented using an algorithmic trick called "hashing" --- a sort of checksum chosen for its extremely low probability of collision when fed arbitrary inputs).

If you initialize values to integers, starting with 1, then you can increment each value as you find its key already in the hash. This is just about the most efficient general purpose means of counting strings.


Not a direct answer, but this will return an array without duplicates:

#!/usr/bin/perl

use strict;
use warnings;

my @arr = ('a','a','a','b','b','c');
my %count;
my @arr_no_dups = grep { !$count{$_}++ } @arr;

print @arr_no_dups, "\n";


Please don't ask about the most time efficient way to do something unless you have some specific requirements, such as "I have to dedupe a list of 100,000 integers in under a second." Otherwise, you're worrying about how long something takes for no reason.


similar to @Schwern's second solution, but checks for duplicates a little earlier from within the comparison function of sort:

use strict;
use warnings;

@_ = sort { print "dup = $a$/" if $a eq $b; $a cmp $b } @ARGV;

it won't be as fast as the hashing solutions, but it requires less memory and is pretty darn cute

0

精彩评论

暂无评论...
验证码 换一张
取 消