开发者

What's the best way to compare arrays of strings in perl

开发者 https://www.devze.com 2023-02-28 18:55 出处:网络
I\'m trying to compare multiple arrays of strings containing file listings of directories. The objective is to determine which files exist in each directory AND which files do not exists. Consider:

I'm trying to compare multiple arrays of strings containing file listings of directories. The objective is to determine which files exist in each directory AND which files do not exists. Consider:

List1    List2    List3    List4
a        a        e        f
b        b        d        g
c        f        a        h

The outcome should be:

List1:

        List1    List2    List3    List4
 a      yes      yes      yes      no
 b      yes      yes      no       no
 c      yes      no       no       no

List2:

        List1    List2    List3    List4
 a      yes      yes      yes      no
 b      yes      yes      no       no
 f      no       yes      no       yes

...

I could go through all the arrays and go through each entry, go through all the other arrays and do a grep:

 for my $curfile (@currentdirfiles) {
   if( grep(/$curfile/, @otherarrsfiles) ) {
        // Set 'yes'
   } else {
        // set 'no'
   }
 }

My only concern is that I am ending up w开发者_StackOverflowith a 0^2n order of magnitude. I may not be able to do anything about this since I would end up looping through all the arrays anyway. One improvement may be in the grep function, but I'm not sure.

Any thoughts?


For lots of string lookups, you generally want to use hashes. Here's one way of doing it:

use strict;
use warnings;

# Define the lists:
my @lists = (
  [qw(a b c)], # List 1
  [qw(a b f)], # List 2
  [qw(e d a)], # List 3
  [qw(f g h)], # List 4
);

# For each file, determine which lists it is in:
my %included;

for my $n (0 .. $#lists) {
  for my $file (@{ $lists[$n] }) {
    $included{$file}[$n] = 1;
  } # end for each $file in this list
} # end for each list number $n

# Print out the results:
my $fileWidth = 8;

for my $n (0 .. $#lists) {

  # Print the header rows:
  printf "\nList %d:\n", $n+1;

  print ' ' x $fileWidth;
  printf "%-8s", "List $_" for 1 .. @lists;
  print "\n";

  # Print a line for each file:
  for my $file (@{ $lists[$n] }) {
    printf "%-${fileWidth}s", $file;

    printf "%-8s", ($_ ? 'yes' : 'no') for @{ $included{$file} }[0 .. $#lists];
    print "\n";
  } # end for each $file in this list
} # end for each list number $n


Why not just remember where each file is when you're reading them in.

Let's say you have a list of directories to read from in @dirlist:

use File::Slurp qw( read_dir );
my %in_dir;
my %dir_files;

foreach my $dir ( @dirlist ) {
    die "No such directory $dir" unless -d $dir;
    foreach my $file ( read_dir($dir) ) {
        $in_dir{$file}{$dir} = 1;
        push @{ $dir_files{$dir} }, $file;
    }
}

Now $in_dir{filename} will have entries defined for each directory of interest, and $dir_files{directory} will have a list of files for each directory...

foreach my $dir ( @dirlist ) {
    print "$dir\n";
    print join("\t", "", @dirlist);
    foreach my $file ( @{ $dir_files{$dir} } ) {
        my @info = ($file);
        foreach my $dir_for_file ( @dirlist ) {
            if ( defined $in_dir{$file}{$dir_for_file} ) {
                push @info, "Yes";
            } else {
                push @info, "No";
            }
        }
        print join("\t", @info), "\n";
    }
}


The clearest way is to use perl5i and autoboxing:

use perl5i;
my @list1 = qw(one two three);
my @list2 = qw(one two four);    

my $missing = @list1 -> diff(\@list2);
my $both = @list1 -> intersect(\@list2);

In a more restricted setup, use hashes for this as the filenames will be unique:

sub in_list {
   my ($one, $two) = @_;
   my (@in, @out);
   my %a = map {$_ => 1} @$one;

   foreach my $f (@$two) {
      if ($a{$f}) {
          push @in, $f;
      }
      else {
          push @out, $f;
      }
   }  
   return (\@in, \@out);
}

my @list1 = qw(one two three);
my @list2 = qw(one two four);    
my ($in, $out) = in_list(\@list1, \@list2);

print "In list 1 and 2:\n";
print "  $_\n" foreach @$in;

print "In list 2 and not in list 1\n";
print "  $_\n" foreach @$out;


Now that the question has been amended, this produces the answer you want. It does work in O(n3) time, which is optimal for the problem (there are n3 outputs).

#!/usr/bin/env perl

use strict;
use warnings;

#List1    List2    List3    List4
#a        a        e        f
#b        b        d        g
#c        f        a        h

my(@lists) = ( { a => 1, b => 1, c => 1 },
               { a => 1, b => 1, f => 1 },
               { e => 1, d => 1, a => 1 },
               { f => 1, g => 1, h => 1 },
             );

my $i = 0;
foreach my $list (@lists)
{
    analyze(++$i, $list, @lists);
}

sub analyze
{
    my($num, $ref, @lists) = @_;
    printf "List %d\n", $num;

    my $pad = "     ";
    foreach my $i (1..4)
    {
        print "$pad   List$i";
        $pad = "";
    }
    print "\n";

    foreach my $file (sort keys %{$ref})
    {
        printf "%-8s", $file;
        foreach my $list (@lists)
        {
            my %dir = %{$list};
            printf "%-8s", (defined $dir{$file}) ? "yes" : "no";
        }
        print "\n";
    }
    print "\n";
}

The output I get is:

List 1
        List1   List2   List3   List4
a       yes     yes     yes     no      
b       yes     yes     no      no      
c       yes     no      no      no      

List 2
        List1   List2   List3   List4
a       yes     yes     yes     no      
b       yes     yes     no      no      
f       no      yes     no      yes     

List 3
        List1   List2   List3   List4
a       yes     yes     yes     no      
d       no      no      yes     no      
e       no      no      yes     no      

List 4
        List1   List2   List3   List4
f       no      yes     no      yes     
g       no      no      no      yes     
h       no      no      no      yes     


My code is simpler but the output isn't quite what you want:

@lst1=('a', 'b', 'c');
@lst2=('a', 'b', 'f');
@lst3=('e', 'd', 'a');
@lst4=('f', 'g', 'h');

%hsh=();

foreach $item (@lst1) {
    $hsh{$item}="list1";
}

foreach $item (@lst2) {
    if (defined($hsh{$item})) {
        $hsh{$item}=$hsh{$item}." list2";
    }
    else {
        $hsh{$item}="list2";
    }
}

foreach $item (@lst3) {
    if (defined($hsh{$item})) {
        $hsh{$item}=$hsh{$item}." list3";
    }
    else {
        $hsh{$item}="list3";
    }
}

foreach $item (@lst4) {
    if (defined($hsh{$item})) {
        $hsh{$item}=$hsh{$item}." list4";
    }
    else {
        $hsh{$item}="list4";
    }
}

foreach $key (sort keys %hsh) {
    printf("%s %s\n", $key, $hsh{$key});
}

Gives:

a list1 list2 list3
b list1 list2
c list1
d list3
e list3
f list2 list4
g list4
h list4


Sorry for the late reply, I've been polishing this a while, because I did not want yet another negative score (bums me out).

This is an interesting efficiency problem. I don't know if my solution will work for you, but I thought I would share it anyway. It is probably efficient only if your arrays do not change too often, and if your arrays contain many duplicate values. I have not run any efficiency checks on it.

Basically, the solution is to remove one dimension of the cross checking by turning the array values into bits, and doing a bitwise comparison on the entire array in one go. Array values are deduped, sorted and given a serial number. The arrays total serial numbers are then stored in a single value by bitwise or. A single array can thereby be checked for a single serial number with only one operation, e.g.:

if ( array & serialno )

It will require one run to prepare the data, which can then be saved in cache or similar. This data can then be used until your data changes (e.g. files/folders are removed or added). I have added a fatal exit on undefined values, which means the data must be refreshed when it occurs.

Good luck!

use strict;
use warnings;

my @list1=('a', 'b', 'c');
my @list2=('a', 'b', 'f');
my @list3=('e', 'd', 'a');
my @list4=('f', 'g', 'h');

# combine arrays
my @total = (@list1, @list2, @list3, @list4);

# dedupe (Thanks Xetius for this code snippet)
my %unique = ();
foreach my $item (@total)
{
    $unique{$item} ++;
}
# Default sort(), don't think it matters
@total = sort keys %unique;

# translate to serial numbers
my %serials = ();
for (my $num = 0; $num <= $#total; $num++)
{
    $serials{$total[$num]} = $num;
}

# convert array values to serial numbers, and combine them
my @tx = ();
for my $entry (@list1) { $tx[0] |= 2**$serials{$entry}; }
for my $entry (@list2) { $tx[1] |= 2**$serials{$entry}; }
for my $entry (@list3) { $tx[2] |= 2**$serials{$entry}; }
for my $entry (@list4) { $tx[3] |= 2**$serials{$entry}; }

&print_all;

sub inList
{
    my ($value, $list) = @_;
    # Undefined serial numbers are not accepted
    if (! defined ($serials{$value}) ) {
            print "$value is not in the predefined list.\n";
            exit;
    }
    return ( 2**$serials{$value} & $tx[$list] );
}

sub yesno
{
    my ($value, $list) = @_;
    return ( &inList($value, $list) ? "yes":"no" );
}
# 
# The following code is for printing purposes only
#
sub print_all
{
    printf "%-6s %-6s %-6s %-6s %-6s\n", "", "List1", "List2", "List3", "List4";
    print "-" x 33, "\n";
    &table_print(@list1);
    &table_print(@list2);
    &table_print(@list3);
    &table_print(@list4);
}

sub table_print
{
    my @list = @_;
    for my $entry (@list) {
        printf "%-6s %-6s %-6s %-6s %-6s\n", $entry,
            &yesno($entry, 0),
            &yesno($entry, 1),
            &yesno($entry, 2),
            &yesno($entry, 3);
    }
    print "-" x 33, "\n";
}


I would build a hash using directory entries as keys containing hashes (actually sets) of each listing in which that was found. Iterate over each listing, for each new entry add it to the outer hash with a single set (or hash) containing the identifier of the listing in which it was first encountered. For any entry that's found in the hash simply add the current listing identifier to the value's set/hash.

From there you can simply post process the sorted keys of the hash, and creating rows of your resulting table.

Personally I think Perl is ugly but here's a sample in Python:

#!/usr/bin/env python
import sys
if len(sys.argv) < 2:
    print >> sys.stderr, "Must supply arguments"
    sys.exit(1)
args = sys.argv[1:]

# build hash entries by iterating over each listing
d = dict()
for each_file in args:
    name = each_file
    f = open(each_file, 'r')
    for line in f:
        line = line.strip()
        if line not in d:
            d[line] = set()
        d[line].add(name)
    f.close()

# post process the hash
report_template = "%-20s" + ("  %-10s" * len(args))
print report_template % (("Dir Entries",) + tuple(args))
for k in sorted(d.keys()):
    row = list()
    for col in args:
        row.append("yes") if col in d[k] else row.append("no")
    print report_template % ((k,)+tuple(row))

That should mostly be legible as if it were psuedo-code. The (k,) and ("Dir Entries",) expressions might look a little odd; but that's to force them to be tuples which are are necessary to unpack into the format string using the % operator for strings. Those could also have been written as tuple([k]+row) for example (wrapping the first item in [] makes it a list which can be added to the other list and all converted to a tuple).

Other than that a translation to Perl should be pretty straightforward, just using hashes instead of dictionaries and sets.

(Incidentally, this example will work with an arbitrary number of listings, supplied as arguments and output as columns. Obviously after a dozen columns the output would get to be rather cumbersome to print or display; but it was an easily generalization to make).

0

精彩评论

暂无评论...
验证码 换一张
取 消