Majority Voting in perl?_问答_开发者_运维开发者技术经验分享

I have 5 files containing the same words. I want to read each word in all the files and decide the winning word by detecting the following characters in a word (*, #, $, &) separated by tabs. Then, I want to generate an output file. Ii can only have 2 winners. For example:

file1

    we$
    are*
    ...

file2

    we$
    are#
    ...

file3

    we&
    are*
    ...

file4

    we$
    are#
    ...

file5

    we$
    are&
    ...

output file:

we$                       
are*#

Here is how I started:

#!/usr/local/bin/perl -w

sub read_file_line {
  my $fh = shift;    
  if ($fh and my $line = <$fh>) {    
    chomp($line);    
    return $line;
  }    
  return;    
}

open(my $f1, "words1.txt") or die "Can't";
open(my $f2, "words2.txt") or die "Can't";
open(my $f3, "words3.txt") or die "Can't";
open(my $f4, "words4.txt") or die "Can't";
open(my $f5, "words5.txt") or die "Can't";

my $r1 = read_file_line($f1);
my $r2 = read_file_line($f2);
my $r3 = read_file_line($f3);
my $r4 = read_file_line($f4);
my $r5 = read_file_line($f5);

while ($f5) {

    #What can I do here to decide and write the winning word in the output file?

$r1 = read_file_li开发者_Go百科ne($f1);
$r2 = read_file_line($f2);
$r3 = read_file_line($f3);
$r4 = read_file_line($f4);
$r5 = read_file_line($f5);
}

Test Data Generator

#!/usr/bin/env perl

use strict;
use warnings;

foreach my $i (1..5)
{
    my $file = "words$i.txt";
    open my $fh, '>', $file or die "Failed to open $file for writing ($!)";
    foreach my $w (qw (we are the people in charge and what we say goes))
    {
        my $suffix = substr('*#$&', rand(4), 1);
        print $fh "$w$suffix\n";
    }
}

Majority Voting Code

#!/usr/bin/env perl

use strict;
use warnings;

my @files = ( "words1.txt", "words2.txt", "words3.txt",
              "words4.txt", "words5.txt"
            );

my @fh;
{
    my $n = 0;
    foreach my $file (@files)
    {
        open my $f, '<', $file or die "Can't open $file for reading ($!)";
        $fh[$n++] = $f;
    }
}

while (my $r = process_line(@fh))
{
    print "$r\n";
}

sub process_line
{
    my(@fhlist) = @_;
    my %words = ();
    foreach my $fh (@fhlist)
    {
        my $line = <$fh>;
        return unless defined $line;
        chomp $line;
        $words{$line}++;
    }

    my $combo = '';
    foreach my $word (keys %words)
    {
        return $word    if ($words{$word} >  2);
        $combo .= $word if ($words{$word} == 2);
    }
    $combo =~ s/(\W)\w+(\W)/$1$2/;
    return $combo;
}

Example Data and Results

$ perl datagenerator.pl
$ perl majorityvoter.pl > results.txt
$ paste words?.txt results.txt
we*     we$     we&     we#     we#     we#
are*    are#    are#    are*    are$    are*#
the*    the&    the#    the#    the&    the&#
people& people& people$ people# people# people&#
in#     in*     in$     in*     in*     in*
charge* charge# charge& charge* charge# charge#*
and$    and*    and$    and&    and$    and$
what&   what&   what$   what&   what#   what&
we#     we*     we*     we&     we*     we*
say$    say&    say$    say$    say$    say$
goes$   goes&   goes#   goes#   goes#   goes#
$

This seems to be correct for the test data in the files generated.

Revised requirements - example output

The 'revised requirements' replaced the '*#$&' markers after the words with a tab and one of the letters 'ABCD'. After some swift negotiation, the question is restored to its original form. This output is from a suitably adapted version of the answer above - 3 code lines changed, 2 in the data generator, 1 in the majority voter. Those changes are not shown - they are trivial.

we      C       we      D       we      C       we      C       we      D       we      C
are     C       are     D       are     C       are     B       are     A       are     C
the     B       the     D       the     A       the     A       the     D       the     A|D
people  D       people  B       people  A       people  B       people  D       people  B|D
in      D       in      B       in      C       in      B       in      D       in      D|B
charge  C       charge  D       charge  D       charge  D       charge  A       charge  D
and     A       and     B       and     C       and     C       and     B       and     B|C
what    B       what    B       what    B       what    C       what    C       what    B
we      D       we      B       we      D       we      B       we      A       we      B|D
say     D       say     D       say     B       say     D       say     D       say     D
goes    A       goes    C       goes    A       goes    C       goes    A       goes    A

Revised test generator - for configurable number of files

Now that the poster has worked out how to handle the revised scenario, this is the data generator code I used - with 5 tags (A-E). Clearly, it would not take a huge amount of work to configure the number of tags on the command line.

#!/usr/bin/env perl

use strict;
use warnings;

my $fmax  = scalar(@ARGV) > 0 ? $ARGV[0] : 5;
my $tags  = 'ABCDE';
my $ntags = length($tags);
my $fmt   = sprintf "words$fmax-%%0%0dd.txt", length($fmax);

foreach my $fnum (1..$fmax)
{
    my $file = sprintf $fmt, $fnum;
    open my $fh, '>', $file or die "Failed to open $file for writing ($!)";
    foreach my $w (qw(We Are The People In Charge And What We Say Goes))
    {
        my $suffix = substr($tags, rand($ntags), 1);
        print $fh "$w\t$suffix\n";
    }
}

Revised Majority Voting Code - for arbitrary number of files

This code works with basically arbitrary numbers of files. As noted in one of the (many) comments, it does not check that the word is the same in each file as required by the question; you could get quirky results if the words are not the same.

#!/usr/bin/env perl

use strict;
use warnings;

my @files = scalar @ARGV > 0 ? @ARGV :
            ( "words1.txt", "words2.txt", "words3.txt",
              "words4.txt", "words5.txt"
            );
my $voters = scalar(@files);

my @fh;
{
    my $n = 0;
    foreach my $file (@files)
    {
        open my $f, '<', $file or die "Can't open $file for reading ($!)";
        $fh[$n++] = $f;
    }
}

while (my $r = process_line(@fh))
{
    print "$r\n";
}

sub process_line
{
    my(@fhlist) = @_;
    my %words = ();
    foreach my $fh (@fhlist)
    {
        my $line = <$fh>;
        return unless defined $line;
        chomp $line;
        $words{$line}++;
    }
    return winner(%words);
}

# Get tag X from entry "word\tX".
sub get_tag_from_word
{
    my($word) = @_;
    return (split /\s/, $word)[1];
}

sub winner
{
    my(%words)   = @_;
    my $maxscore = 0;
    my $winscore = ($voters / 2) + 1;
    my $winner   = '';
    my $taglist  = '';
    foreach my $word (sort keys %words)
    {
        return "$word\t$words{$word}" if ($words{$word} >= $winscore);
        if ($words{$word} > $maxscore)
        {
            $winner = $word;
            $winner =~ s/\t.//;
            $taglist = get_tag_from_word($word);
            $maxscore = $words{$word};
        }
        elsif ($words{$word} == $maxscore)
        {
            my $newtag = get_tag_from_word($word);
            $taglist .= "|$newtag";
        }
    }
    return "$winner\t$taglist\t$maxscore";
}

One Example Run

After considerable experimentation on the data presentation, one particular set of data I generated gave the result:

We          A|B|C|D|E   2  B  C  C  E  D  A  D  A  E  B
Are         D           4  C  D  B  A  D  B  D  D  B  E
The         A           5  D  A  B  B  A  A  B  E  A  A
People      D           4  E  D  C  D  B  E  D  D  B  C
In          D           3  E  C  D  D  D  B  C  A  A  B
Charge      A|E         3  E  E  D  A  D  A  B  A  E  B
And         E           3  C  E  D  D  C  A  B  E  B  E
What        A           5  B  C  C  A  A  A  B  A  D  A
We          A           4  C  A  A  E  A  E  C  D  A  E
Say         A|D         4  A  C  A  A  D  E  D  A  D  D
Goes        A           3  D  B  A  C  C  A  A  E  E  B

The first column is the word; the second is the winning tag or tags; the third (numeric) column is the maximum score; the remaining 10 columns are the tags from the 10 data files. As you can see, there two each of 'We A', 'We B', ... 'We E' in the first row. I've also generated (but not preserved) one result set where the maximum score was 7. Given enough repetition, these sorts of variations are findable.

Sounds like the job for a hash of hashes. Untested code:

use strict;
use warnings;
use 5.010;
use autodie;
use List::Util qw( sum reduce );

my %totals;

my @files = map "words$_.txt", 1..5;

for my $file (@files) {
    open my $fh, '<', $file;
    while (<$fh>) {
        chomp;
        my ($word, $sign) = /(\w+)(\W)/;
        $totals{$word}{$sign}++;
    }
}

open my $totals_fh, '>', 'outfile.txt';

my @sorted_words = sort { sum values %{$totals{$a}} <=> sum values %{$totals{$b}} } keys %totals; #Probably something fancier here.

for my $word (@sorted_words[0, 1]) {
    #say {$totals_fh} $word, join('', keys %{$totals{$word}} ), "\t- ", function_to_decide_text($totals{$word});
    say {$totals_fh} $word, reduce {
            $totals{$word}{ substr $a, 0, 1 } == $totals{$word}{$b} ? $a . $b
          : $totals{$word}{ substr $a, 0, 1 } > $totals{$word}{$b} ? $a
          :                                                          $b;
    } keys %{ $totals{$word} };
}

EDIT: Forgot about the only two winners part. Fixed, somewhat.

EDIT2: Fixed as per comments.

#!/usr/bin/perl

use strict;
use warnings;

my @files   = qw(file1 file2 file3 file4 file5);
my $symbols = '*#$&'; # no need to escape them as they'll be in a character class
my %words;

foreach my $file (@files) {
   open(my $fh, '<', $file) or die "Cannot open $file: $!";
   while (<$fh>) {
      if (/^(\w+[$symbols])$/) {
         $words{$1} ++; # count the occurrences of each word
      }
   }
   close $fh;
}

my $counter  = 0;
my $previous = -1;

foreach my $word (sort {$words{$b} <=> $words{$a}} keys %words) {

   # make sure you don't exit if two words at the top of the list 
   # have the same number of occurrences
   if ($previous != $words{$word}) {
      last if $counter > 1;
   }
   $counter ++; # count the output
   $previous = $words{$word};

   print "$word occurred $words{$word} times.\n";
}

Worked when I tried it out...