开发者

Single line regular expression needed for a pattern in Perl

开发者 https://www.devze.com 2023-02-19 13:48 出处:网络
I need to read many HTML files containing similar structure using perl. The structure consists of STRRRR...E

I need to read many HTML files containing similar structure using perl.

The structure consists of STRRRR...E

  • S=html header just before table begins
  • T=unique table start structure in the html file(I can identify it)
  • R=Group of html elements(those are tr's, I can identify it too)
  • E=All remaining - singnifies end R's

I want to extract all R's in array using single line "m" perlop.

I'm looking for something li开发者_运维技巧ke this:

@all_Rs = $htmlfile=~m{ST(R)*E}gs;

But it has never worked out.

Until now I've been doing round about way to do it like using deleting unwanted text, for loop etc. I want to extract all rows from this page: http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx and there are many such pages.


Regex is the wrong tool. Use an HTML parser.

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_content(<<'END_OF_HTML');
<html>
    <table>
        <tr>1
        <tr>2
        <tr>3
        <tr>4
        <tr>5
    </table>
</html>
END_OF_HTML

print $_->as_text for $tree->findnodes('//tr');

HTML::TreeBuilder::XPath inherits from HTML::TreeBuilder.


daxim is right about using a real parser. My personal choice is XML::LibXML.

use XML::LibXML
my $parser = XML::LibXML->new();
$parser->recover(1);                 # don't fail on parsing errors
my $doc = do { 
    local $SIG{__WARN__} = sub {};   # silence warning about parsing errors
    $parser->parse_html_file('http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx');
};

print $_->toString() for $doc->findnodes('//tr[td[1][@class="td_background"]]');

This gets me each station row from that page.

For a bit more work we can have a nice data structure to hold the text in each cell.

use Data::Dumper;
my @data = map {
    my $row = $_;
    [ map {
        $_->findvalue('normalize-space(text())');
    } $row->findnodes('td') ]
} $doc->findnodes('//tr[td[1][@class="td_background"]]');
print Dumper \@data;


If you want to process an HTML table, consider using a module that knows how to process HTML tables!

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;


my $html = get 'http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx';
$html =~ s/&nbsp;/ /g;

my $te = new HTML::TableExtract( depth => 1, count => 2 );
$te->parse($html);
foreach my $ts ($te->table_states) {
   foreach my $row ($ts->rows) {
      next if $row->[0] =~ /^\s*(Next|Station)/;
      next if $row->[4] =~ /^\s*(ARR\/DEP|RESERVATION)/;
      foreach my $cell (@$row) {
          $cell =~ s/^\s+//;
          $cell =~ s/\s+$//;
          print "$cell\n";
      }
      print "\n";
   }
}
0

精彩评论

暂无评论...
验证码 换一张
取 消