Removing files with duplicate content from single directory [Perl, or algorithm]_问答_开发者

I have a folder with large number of files, some of with have exactly the same contents. I want to remove files with duplicate contents, meaning if two or more files with duplicate content found, I'd like to leave one of these files, and delete the others.

Following is what I came up with, but I don't know if it works :) , didn't try it yet.

How would you do it? Perl or general algorithm.

use strict;
use warnings;

my @files = <"./files/*.txt">;

my $current = 0;

while( $current <= $#files ) {

    # read contents of $files[$current] into $contents1 scalar

    my $compareTo = $current + 1;
    while( $compareTo <= $#files ) {

        # read contents of $files[compareTo] into $contents2 scalar

        if( $contents1 eq $contents2 ) {
            splice(@files, $compareTo, 1);开发者_JAVA技巧
            # delete $files[compareTo] here
        }
        else {
            $compareTo++;
        }
    }

    $current++;
}

Here's a general algorithm (edited for efficiency now that I've shaken off the sleepies -- and I also fixed a bug that no one reported)... :)

It's going to take forever (not to mention a lot of memory) if I compare every single file's contents against every other. Instead, why don't we apply the same search to their sizes first, and then compare checksums for those files of identical size.

So then when we ~~md5sum every file (see Digest::MD5)~~ calculate their sizes, we can use a hash table to do our matching for us, storing the matches together in arrayrefs:

use strict;
use warnings;
use Digest::MD5 qw(md5_hex);

my %files_by_size;
foreach my $file (@ARGV)
{
    push @{$files_by_size{-s $file}}, $file;   # store filename in the bucket for this file size (in bytes)
}

Now we just have to pull out the potential duplicates and check if they are the same (by creating a checksum for each, using Digest::MD5), using the same hashing technique:

while (my ($size, $files) = each %files_by_size)
{
    next if @$files == 1;

    my %files_by_md5;
    foreach my $file (@$files_by_md5)
    {
        open my $filehandle, '<', $file or die "Can't open $file: $!";
        # enable slurp mode
        local $/;
        my $data = <$filehandle>;
        close $filehandle;

        my $md5 = md5_hex($data);
        push @{$files_by_md5{$md5}}, $file;       # store filename in the bucket for this MD5
    }

    while (my ($md5, $files) = each %files_by_md5)
    {
        next if @$files == 1;
        print "These files are equal: " . join(", ", @$files) . "\n";
    }
}

-fini

Perl, with Digest::MD5 module.

use Digest::MD5 ;
%seen = ();
while( <*> ){
    -d and next;
    $filename="$_"; 
    print "doing .. $filename\n";
    $md5 = getmd5($filename) ."\n";    
    if ( ! defined( $seen{$md5} ) ){
        $seen{$md5}="$filename";
    }else{
        print "Duplicate: $filename and $seen{$md5}\n";
    }
}
sub getmd5 {
    my $file = "$_";            
    open(FH,"<",$file) or die "Cannot open file: $!\n";
    binmode(FH);
    my $md5 = Digest::MD5->new;
    $md5->addfile(FH);
    close(FH);
    return $md5->hexdigest;
}

If Perl is not a must and you are working on *nix, you can use shell tools

find /path -type f -print0 | xargs -0 md5sum | \
    awk '($1 in seen){ print "duplicate: "$2" and "seen[$1] } \
         ( ! ($1 in  seen ) ) { seen[$1]=$2 }'

md5sum *.txt | perl -ne '
   chomp; 
   ($sum, $file) = split(" "); 
   push @{$files{$sum}}, $file; 
   END {
      foreach (keys %files) { 
         shift @{$files{$_}}; 
         unlink @{$files{$_}} if @{$files{$_}};
      }
   }
'

Perl is kinda overkill for this:

md5sum * | sort | uniq -w 32 -D | cut -b 35- | tr '\n' '\0' | xargs -0 rm

(If you are missing some of these utilities or they don't have these flags/functions, install GNU findutils and coreutils.)

Variations on a theme:

md5sum *.txt | perl -lne '
  my ($sum, $file) = split " ", $_, 2;
  unlink $file if $seen{$sum} ++;
'

No need to go and keep a list, just to remove one from the list and delete the rest; simply keep track of what you've seen before, and remove any file matching a sum that's already been seen. The 2-limit split is to do the right thing with filenames containing spaces.

Also, if you don't trust this, just change the word unlink to print and it will output a list of files to be removed. You can even tee that output to a file, and then rm $(cat to-delete.txt) in the end if it looks good.

a bash script is more expressive than perl in this case:

md5sum * |sort -k1|uniq -w32 -d|cut -f2 -d' '|xargs rm

I'd recommend that you do it in Perl, and use File::Find while you're at it.
Who knows what you're doing to generate your list of files, but you might want to combine it with your duplicate checking.

perl -MFile::Find -MDigest::MD5 -e '
my %m;
find(sub{
  if(-f&&-r){
   open(F,"<",$File::Find::name);
   binmode F;
   $d=Digest::MD5->new->addfile(F);
   if(exists($m{$d->hexdigest}){
     $m{$d->hexdigest}[5]++;
     push $m{$d->hexdigest}[0], $File::Find::name;
   }else{
     $m{$d->hexdigest} = [[$File::Find::name],0,0,0,0,1];
   }
   close F
 }},".");
 foreach $d (keys %m) {
   if ($m{$d}[5] > 1) {
     print "Probable duplicates: ".join(" , ",$m{$d}[0])."\n\n";
   }
 }'

Here is a way of filtering by size first and by md5 checksum second:

#!/usr/bin/perl

use strict; use warnings;

use Digest::MD5 qw( md5_hex );
use File::Slurp;
use File::Spec::Functions qw( catfile rel2abs );
use Getopt::Std;

my %opts;

getopt('de', \%opts);
$opts{d} = '.' unless defined $opts{d};
$opts{d} = rel2abs $opts{d};

warn sprintf "Checking %s\n", $opts{d};

my $files = get_same_size_files( \%opts );

$files = get_same_md5_files( $files );

for my $size ( keys %$files ) {
    for my $digest ( keys %{ $files->{$size}} ) {
        print "$digest ($size)\n";
        print "$_\n" for @{ $files->{$size}->{$digest} };
        print "\n";
    }
}

sub get_same_md5_files {
    my ($files) = @_;

    my %out;

    for my $size ( keys %$files ) {
        my %md5;
        for my $file ( @{ $files->{$size}} ) {
            my $contents = read_file $file, {binmode => ':raw'};
            push @{ $md5{ md5_hex($contents) } }, $file;
        }
        for my $k ( keys %md5 ) {
            delete $md5{$k} unless @{ $md5{$k} } > 1;
        }
        $out{$size} = \%md5 if keys %md5;
    }
    return \%out;
}

sub get_same_size_files {
    my ($opts) = @_;

    my $checker = defined($opts->{e})
                ? sub { scalar ($_[0] =~ /\.$opts->{e}\z/) }
                : sub { 1 };

    my %sizes;
    my @files = grep { $checker->($_) } read_dir $opts->{d};

    for my $file ( @files ) {
        my $path = catfile $opts->{d}, $file;
        next unless -f $path;

        my $size = (stat $path)[7];
        push @{ $sizes{$size} }, $path;
    }

    for my $k (keys %sizes) {
        delete $sizes{$k} unless @{ $sizes{$k} } > 1;
    }

    return \%sizes;
}

You might want to have a look at how I did to find duplicate files and remove them. Though you have to modify it to your needs.

http://priyank.co.in/remove-duplicate-files