I need to read many HTML files containing similar structure using perl.
The structure consists of STRRRR...E
- S=html header just before table begins
- T=unique table start structure in the html file(I can identify it)
- R=Group of html elements(those are tr's, I can identify it too)
- E=All remaining - singnifies end R's
I want to extract all R's in array using single line "m" perlop.
I'm looking for something li开发者_运维技巧ke this:
@all_Rs = $htmlfile=~m{ST(R)*E}gs;
But it has never worked out.
Until now I've been doing round about way to do it like using deleting unwanted text, for loop etc. I want to extract all rows from this page: http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx and there are many such pages.
Regex is the wrong tool. Use an HTML parser.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_content(<<'END_OF_HTML');
<html>
<table>
<tr>1
<tr>2
<tr>3
<tr>4
<tr>5
</table>
</html>
END_OF_HTML
print $_->as_text for $tree->findnodes('//tr');
HTML::TreeBuilder::XPath inherits from HTML::TreeBuilder.
daxim is right about using a real parser. My personal choice is XML::LibXML.
use XML::LibXML
my $parser = XML::LibXML->new();
$parser->recover(1); # don't fail on parsing errors
my $doc = do {
local $SIG{__WARN__} = sub {}; # silence warning about parsing errors
$parser->parse_html_file('http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx');
};
print $_->toString() for $doc->findnodes('//tr[td[1][@class="td_background"]]');
This gets me each station row from that page.
For a bit more work we can have a nice data structure to hold the text in each cell.
use Data::Dumper;
my @data = map {
my $row = $_;
[ map {
$_->findvalue('normalize-space(text())');
} $row->findnodes('td') ]
} $doc->findnodes('//tr[td[1][@class="td_background"]]');
print Dumper \@data;
If you want to process an HTML table, consider using a module that knows how to process HTML tables!
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
my $html = get 'http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx';
$html =~ s/ / /g;
my $te = new HTML::TableExtract( depth => 1, count => 2 );
$te->parse($html);
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
next if $row->[0] =~ /^\s*(Next|Station)/;
next if $row->[4] =~ /^\s*(ARR\/DEP|RESERVATION)/;
foreach my $cell (@$row) {
$cell =~ s/^\s+//;
$cell =~ s/\s+$//;
print "$cell\n";
}
print "\n";
}
}
精彩评论