I have to parse 5000 files - which look pretty identical.
I like using HTML::TokeParser::Simple and DBI in order to do the parsing job and store the results.
I have little experience with HTML::TokeParser::Simple
but this task goes over
my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at the moment i have issues t开发者_C百科o get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme.
This is what I have right now:
use strict;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
#use real file name here
open(my $fh, "<", "file.html") or die $!;
$tree->parse_file($fh);
my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;
is this all right ? Note - i w ant to store this in a database.
BTW: See one of the example sites:
http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488
in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!
That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.
Can i make use of the above mentioned code... or do i have to change it.
Love to hear from you! That would be great!!
Use some HTML::TableExtract magic:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $te = HTML::TableExtract->new( attribs => {
border => 0,
bgcolor => '#EFEFEF',
leftmargin => 15,
topmargin => 5,
});
$te->parse_file('kultus-bw.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(@$row);
print "@$row\n";
}
sub cleanup {
for ( @_ ) {
s/\s+//;
s/[\xa0 ]+\z//;
s/\s+/ /g;
}
}
Output:
Schul-/Behördenname: Abendgymnasium Ostwürttemberg Schulart: Privatschule (04313488) Hausadressse: Friedrichstr.70, 73430 Aalen Postfachadresse: Keine Angabe Telefon: 07361/680040 Fax: 07361/680040 E-Mail: Keine Angabe Internet: www.abendgymnasium-ostwuerttemberg.de ÜbergeordneteDienststelle: Regierungspräsidium Stuttgart Abteilung 7 Schule und Bildung Schulleitung: Keine Angabe Stellv.Schulleitung: Keine Angabe AnzahlSchüler: 259 AnzahlKlassen: 8 AnzahlLehrer: Keine Angabe Kreis: Ostalbkreis Schulträger: <Verband/Verein> (Verband/Verein)
Of course, I saved a local copy of the page before running the script.
精彩评论