开发者

Why does my Perl script keep reading from same file, even though I closed it?

开发者 https://www.devze.com 2022-12-17 09:11 出处:网络
I\'m writing this Perl script that gets two command line arguments: a directory and a year.In this directory is a ton of text files or html files(depending on the year).Lets say for instance it\'s the

I'm writing this Perl script that gets two command line arguments: a directory and a year. In this directory is a ton of text files or html files(depending on the year). Lets say for instance it's the year 2010 which contains files that look like this <number>rank.html with the number ranging from 2001 to 2212. I want it to open each file individually and take a part of the title in the html file and print it to a text file. However, when I run my code it just prints the first files title to the text file. It seems that it only ever opens the first file 2001rank.html and no others. I'll post the code below and thanks to anyone that helps.

my $directory = shift or "Must supply directory\n";
my $year = shift or "Must supply year\n";

unless (-d $directory) {
  die "Error: Directory must be a directory\n";
}

unless ($directory =~ m/\/$/) {
  $directory = "$directory/";
}

open COLUMNS, "> columns$year.txt" or die "Can't open columns file";
my $column_name;

for (my $i = 2001; $i <= 2212; $i++) {

  if ($year >= 2009) {
   开发者_开发问答 my $html_file = $directory.$i."rank.html";
    open FILE, $html_file;

    #check if opened correctly, if not, skip it
    unless (defined fileno(FILE)) {
      print "skipping $html_file\n";
      next;
    }

    $/ = "\n";
    my $line = <FILE>;

    if (defined $line) {
      $column_name = "";
      $_ = <FILE> until m{</title>};
      $_ =~ m{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i;
      $column_name = $1;
    }
    else {
      close FILE;
      next;
    }
    close FILE;
  }
  else {
    my $text_file = $directory.$i."rank.txt";
    open FILE, $text_file;

    unless (defined fileno(FILE)) {
      print "skipping $text_file\n";
      next;
    }

    $/ = "\r";
    my $line = <FILE>;

    if (defined $line) {
      $column_name = "";
      $_ = <FILE> until /Rank/i;
      $_ =~ /Rank(\s+)Country(\s+)(.+)(\s+)Date/i;
      $column_name = $3;
    }
    else {
      close FILE;
      next;
    }
    close FILE;
  }

  print "Adding $column_name to text file\n";
  print COLUMNS "$column_name\n";
}

close COLUMNS;

In other words $column_name gets set equal to the same thing every pass in the loop, even though I know the html files are different.


You'll probably be able to debug this a lot faster if you convert using local lexicals for your filehandles instead of globals, as well as turn on strict checking:

use strict;
use warnings;

while (...)
{
    # ...
    open my $filehandle, $html_file;

    # ...
    my $line = <$filehandle>;
}

This way, the filehandle(s) will go out of scope during each loop iteration, so you can more clearly see what exactly is being referenced and where. (Hint: you may have missed a condition where the filehandle gets closed, so it is improperly reused the next time around.)

For more on best practices with open and filehandles, see:

  • Why is three-argument open calls with autovivified filehandles a Perl best practice?
  • What's the best way to open and read a file in Perl?

Some other points:

  • Don't ever explicitly assign to $_, that's asking for trouble. Declare your own variable to hold your data: my $line = <$filehandle> (as in the example above)
  • Pull out your matches directly into variables, rather than using $1, $2 etc, and only use parentheses for the portions you actually need: my ($column_name) = ($line =~ m/Rank\s+Country\s+.+(\s+)Date/i);
  • put the error conditions first, so the bulk of your code can be outdented one (or more) level(s). This will improve readability, as when the bulk of your algorithm is visible on the screen at once, you can better visualize what it is doing and catch errors.

If you apply the points above I'm pretty sure that you'll spot your error. I spotted it while making this last edit, but I think you'll learn more if you discover it yourself. (I'm not trying to be snooty; trust me on this!)


Your processing is similar for HTML and text files, so make your life easy and factor out the common part:

sub scrape {
  my($path,$pattern,$sep) = @_;

  unless (open FILE, $path) {
    warn "$0: skipping $path: $!\n";
    return;
  }

  local $/ = $sep;

  my $column_name;
  while (<FILE>) {
    next unless /$pattern/;
    $column_name = $1;
    last;
  }

  close FILE;

  ($path,$column_name);
}

Then make it specific for the two types of input:

sub scrape_html {
  my($directory,$i) = @_;

  scrape $directory.$i."rank.html", 
         qr{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i,
         "\n";
}

sub scrape_txt {
  my($directory,$i) = @_;

  scrape $directory.$i."rank.txt",
         qr/Rank\s+Country\s+(.+)\s+Date/i,
         "\r";
}

Then your main program is straightforward:

my $directory = shift or die "$0: must supply directory\n";
my $year      = shift or die "$0: must supply year\n";

die "$0: $directory is not a directory\n"
  unless -d $directory;

# add trailing slash if necessary
$directory =~ s{([^/])$}{$1/};

my $columns_file = "columns$year.txt";
open COLUMNS, ">", $columns_file
  or die "$0: open $columns_file: $!";

for (my $i = 2001; $i <= 2212; $i++) {
  my $process = $year >= 2009 ? \&scrape_html : \&scrape_txt;

  my($path,$column_name) = $process->($directory,$i);

  next unless defined $path;

  if (defined $column_name) {
    print "$0: Adding $column_name to text file\n";
    print COLUMNS "$column_name\n";
  }
  else {
    warn "$0: no column name in $path\n";
  }
}

close COLUMNS or warn "$0: close $columns_file: $!\n";

Note how careful you have to be to close global filehandles. Please use lexical filehandles as in

open my $fh, $path or die "$0: open $path: $!";

Passing $fh as a parameter or stuffing it in hashes is much nicer. Also, lexical filehandles close automatically when they go out of scope. There's no chance of stomping on a handle someone else is already using.


Have you considered grep?

grep out just the line from the HTML containing the title, and then process the output of grep.

Simpler, as you would not have to write any file-handling code. You didn't say what you want with that title - if you only need a list, you might not need to write any code at all.

Try something like:

grep -ri title <directoryname>
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号