How can I take queries from one file, search another, and output to a third, in Perl?_问答_开发者

Edit: My original title has been sort of changed. I suspect the current title does not reveal my original purpose: let Perl automatically use the contents of one file as the source of search keywords to search another file and then output the matches to a third file.

This means without this kind of automation, I would have to manually type those query terms that are listed in FILE1 one by one and get matches from FILE2 one at a time by simply writing something like while(<FILE2>){if (/query terms/){print FILE3 $_}}.

To be more specific, FILE1 should look something like this:

azure
Byzantine
cystitis
dyspeptic
eyrie
fuzz

FILE2 might (or might not) look something like this:

azalea        n.  flowering shrub of the rhododendron family
azure         adj. bright blue, as of the sky 
byte          n. fixed number of binary digits, often representing a single character
Byzantine     adj. of Byzantium or the E Roman Empire
cystitis      n. inflammation of the bladder
Czech         adj. of the Czech Republic or Bohemia
dyslexic      adj. suffering from dyslexia
dyspeptic     adj. suffering from dyspepsia
eyelet        n. small hole in cloth, in a sail, etc for a rope, etc to go through; 
eyrie         n. eagle's nest
fuzz          n. mass of soft light particle
fuzzy         adj. like fuzz

FILE3 should look something like this if FILE2 is the way it is like above:

azure         adj. bright blue, as of the sky 
Byzantine     adj. of Byzantium or the E Roman Empire
cystitis      n. inflammation of the bladder
dyspeptic     adj. suffering from dyspepsia
eyrie         n. eagle's nest
fuzz          n. mass of soft light particle

It took me hours of trial and error to finally figure out a seemingly working solution, but my code is probably buggy, not to mention inefficient. I hope you guys can send me on the right track if I'm wrong, kindly offer me some guidance and share with me some different approaches to the problem if any (Well, there must be). As suggested by daotoad, I'm trying to comment out what each line of code does. Please correct me if I misunderstood something.

#!perl  #for Windows, simply perl suffices. I'm reading *Learning Perl*.    
use warnings; #very annoying I've always been receiving floods of error messages
use strict;   #I often have to look here and there because of my carelessness

open my $dic,'<', 'c:/FILE2.txt' or die "Cannot open dic.txt ;$!"; # 3-argument version of open statement helps avoid possible confusion; Dunno why when I replace dic.txt with $dic in the death note, I'll receive "needs explicit package name" warning. Any ideas?
open my $filter,'<','c:/FILE1.txt' or die "Cannot open new_word.txt :$!"; 
my @filter=<$filter>; #store the entire contents of FILE1 into @filter.
close $filter;        #FILE1 is useless so close the connection between FILE1 and perl
open my $learn,'>','c:/FILE3.txt'; #This file is where I output matching lines.
my $candidate="";     #initialize the candidate to empty string. It will be used to store matching lines. Learnt this from Jeff.

while(<$dic>){    #let perl read the contents of FILE2 line by line.
for (my $n=0; $n<=$#filter; $n++){ #let perl go through each line of FILE1 too
my $entry = $filter[$n];
chomp($entry);   #Figured out this line must be added after many fruitless attempts
if (/^$entry\s/){  #let perl compare each line of FILE2 with any line of FILE1.
$candidate.= $_ ; } #every time a match is found, store the line into $candidate
}
}
print $learn $candidate; #output the results to FILE3

Update 1:

Thank you very much for the guidance! I truly appreciate it :)

I believe I'm now going in a somewhat different direction as I originally intended. The concept of hashes was beyond the then stock of my Perl knowledge. Having finished the hashes section of learning Perl, I'm now thinking: although the use of hashes may effiently solve the example problem I posted above, situations might get complicated if the headwords (not the whole entry) in the definition file (FILE2) have duplica开发者_运维问答tes.

But on the other hand, I see hashes are very important in programming in Perl. So this morning I tried to implement @mobrule's idea: load the contents of FILE1 into a hash and then check whether the first word of each line in FILE2 was in your hash table.. But then I decided I should load FILE2 into a hash instead of FILE1 because FILE2 contains dictionary entries and it is meaningful to treat HEADWORDS as KEYS and DEFINITIONS as VALUES. Now I came up with the following code. It seems close to success.

#!perl

open my $learn,'>','c:/file3.txt' or die "Cannot open Study Note;$!";
open my $dic,"<",'c:/file2.txt' or die "Cannot open Dictionary: $!";
my %hash = map {split/\t+/} <$dic>; # #I did some googling on how to load a file into a hash and found this works. But actually I don't quite understand why. I figured the pattern out by myself. /\t+/ seems to be working because the headwords and the main entries in FILE2 are separated by tabs.  

open my $filter,'<','c:/file1.txt' or die "Cannot open Glossary: $!";
while($line=<$filter>){
chomp ($line);
if (exists $hash{$line}){
print "$learn $hash{$line}"; # this line is buggy. first it won't output to FILE3. second, it only prints the values of the hash but I want to include the keys.
}
}

The code outputs the following results on the screen:

GLOB(0x285ef8) adj. bright blue, as of the sky
GLOB(0x285ef8) adj. of Byzantium or the E Roman Empire
GLOB(0x285ef8) n. inflammation of the bladder
GLOB(0x285ef8) adj. suffering from dyspepsia
GLOB(0x285ef8) n. eagle's nest
GLOB(0x285ef8) n. mass of soft light particle

Update 2:

One problem solved. I can print both keys and values now by doing a minor modification of the last line.

print "$learn $line: $hash{$line}";

Update 3:

Haha: I made it! I made it :) I modified the code again and now it outputs stuff to FILE3!

#!perl

open my $learn,'>','c:/file3.txt' or die $!;
open my $dic,"<",'c:/file2.txt' or die $!;
my %hash = map {split/\t+/} <$dic>; #the /\t+/ pattern works because the entries in my FILE2 are separated into the headwords and the definition by two tab spaces. 

open my $filter,'<','c:/file1.txt' or die $!;
while($line=<$filter>){
chomp ($line);
if (exists $hash{$line}){
print $learn "$line: $hash{$line}";
}
}

Update 4:

I'm thinking if my FILE2 has totally different contents, say, sentences that contain query words in FILE1, it will be difficult, if not impossible, for us to use the hash approach, right?

Update 5:

Having carefully read the perlfunc page about the split operator, now I know how to improve my code :)

#!perl

open my $learn,'>','c:/file3.txt' or die $!;
open my $dic,"<",'c:/file2.txt' or die $!;
my %hash = map {split/\s+/,$_,2} <$dic>; # sets the limit of separate fields to 2
open my $filter,'<','c:/file1.txt' or die $!;
while($line=<$filter>){
chomp ($line);
if (exists $hash{$line}){
print $learn "$line: $hash{$line}";
}
}

You're making the problem harder than it needs to be by thinking about all of it at once rather than breaking it down into manageable bits.

It doesn't look like you need regexes here. You just need to see if the term in the first column was in the list:

open my($patterns), '<', 'patterns.txt' or die "Could not get patterns: $!"; 

my %hash = map { my $p = $_; chomp $p; $p, 1 } <$patterns>;

open my($lines), '<', 'file.txt' or die "Could not open file.txt: $!";

while ( <$lines> ) {
    my( $term ) = split /\s+/, $_, 2;
    print if exists $hash{$term};
    }

If you really needed regular expressions to find the terms, you might be able to get away with just grep:

 grep -f patterns.txt file.txt

Have you gotten to the part of Learning Perl where you learn about hashes? You could load the contents of FILE1 into a hash and then check whether the first word of each line in FILE2 was in your hash table.

If you don't actually have to use Perl, (and you have cygwin or something else unixy installed), you can just do grep -f new_word.txt dic.txt. But let's assume you want to learn something about Perl here.. :)

use strict and use warnings are invaluable for spotting problems (and for teaching good habits). Remember that if you're unsure what a warning message means, you can look it up in perldoc perldiag.

Regarding your comment "Dunno why when I replace dic.txt with $dic in the death note, I'll receive "needs explicit package name" warning. Any ideas?" -- $dic is not a filename, but a file handle, and is not something you generally want to print out. To avoid using the filename twice (say, to make it easier to change later), define it at the top of the file, as I have done.

Using subroutines to advance the position in each file feels a little crude, but this algorithm only loops through each file once, and does not read either file into memory, so it will work even for huge input files. (This hinges on both files being sorted, which they appear to be in the example you provide.)

Code edited and fixed. I shouldn't have banged off a version just before bed and then not tested it (I blame the spouse) :D

use warnings;
use strict;

my $dictFile = 'dict.txt';
my $wordsFile = 'words.txt';
my $outFile = 'out.txt';

open my $dic, '<', $dictFile or die "Cannot open $dictFile: $!";
open my $filter, '<', $wordsFile or die "Cannot open $wordsFile: $!";
open my $learn, '>', $outFile or die "Cannot open $outFile: $!";

# create variables before declaring subs, which creates closures
my ($word, $key, $sep, $definition);
sub nextWord {
    $word = <$filter>;
    done() unless $word;
    chomp $word;
};
sub nextEntry {
    # use parens around pattern to capture it into the list for later use
    ($key, $sep, $definition) = split(/(\s+)/, <$dic>, 2);
    done() unless $key;
}
sub done
{
    close $filter or warn "can't close $wordsFile: $!";
    close $dic or warn "can't close $dictFile: $!";
    close $learn or warn "can't close $outFile: $!";
    exit;
}

nextWord();
nextEntry();

# now let's loop until we hit the end of one of the input files
for (;;)
{
    if ($word lt $key)
    {
        nextWord();
    }
    elsif ($word gt $key)
    {
        nextEntry();
    }
    else    # word eq $key
    {
        # newline is still in definition; no need to append another
        print $learn ($key . $sep . $definition);
        nextWord();
        nextEntry();
    }
}

It seems reasonable to me to assume that the number of words to look up will be small relative to the size of the dictionary. Therefore, you can read FILE1.txt into memory, putting each word into a hash.

Then, read the dictionary, outputting only the lines where the term is in the hash. I would also output to STDOUT which can then be redirected from the command line to any file you want.

#!/usr/bin/perl

use strict; use warnings;
use autodie qw(open close);

my ($words_file, $dict_file) = @ARGV;

my %words;
read_words(\%words, $words_file);

open my $dict_fh, '<', $dict_file;

while ( my $line = <$dict_fh> ) {
    # capturing match in list context returns captured matches
    if (my ($term) = ($line =~ /^(\w+)\s+\w/)) {
        print $line if exists $words{$term};
    }
}

close $dict_fh;

sub read_words {
    my ($words, $filename) = @_;

    open my $fh, '<', $filename;
    while ( <$fh> ) {
        last unless /^(\w+)/;
        $words->{$1} = undef;
    }
    close $fh;
    return;
}

Invocation:

C:\Temp> lookup.pl FILE1.txt FILE2.txt > FILE3.txt

Output:

C:\Temp> type FILE3.txt
azure         adj. bright blue, as of the sky
Byzantine     adj. of Byzantium or the E Roman Empire
cystitis      n. inflammation of the bladder
dyspeptic     adj. suffering from dyspepsia
eyrie         n. eagle's nest
fuzz          n. mass of soft light particle

Are FILE1 and FILE2 initially sorted? If so, you only need a single loop, not a nested one:

use 5.010;
use warnings;
use strict;

my $dictFile = 'c:/FILE2.txt';
my $wordsFile = 'c:/FILE1.txt';
my $outFile = 'c:/FILE3.txt';

open my $dic, '<', $dictFile or die "Cannot open $dictFile: $!";
open my $filter, '<', $wordsFile or die "Cannot open $wordsFile: $!";
open my $learn, '>', $outFile or die "Cannot open $outFile: $!";

my $dic_line;
my $dic_word;
my $filter_word;

# loop forever (or until last'ing out of the loop, anyway)
while (1) {
    # if we don't have a word from the filter list, get one
    if ( ! defined $filter_word ) {
        # get a line from the filter file, bailing out of the loop if at the end
        $filter_word = <$filter> // last;
        # remove the newline so we can string compare
        chomp($filter_word);
    }
    # if we don't have a word from the dictionary, get one
    if ( ! defined $dic_line ) {
        # get a line from the dictionary, bailing out of the loop if at the end
        $dic_line = <$dic> // last;
        # get the first word on the line
        ($dic_word) = split ' ', $dic_line;
    }
    # if we have a match, print it
    if ( $dic_word eq $filter_word ) { print $learn $dic_line }
    # only keep considering this dictionary line if it is beyond the filter word we had
    if ( lc $dic_word le lc $filter_word ) { undef $dic_line }
    # only keep considering this filter word if it is beyond the dictionary line we had
    if ( lc $dic_word ge lc $filter_word ) { undef $filter_word }
}