I have an html page that has particular text that I w开发者_开发百科ant to parse into a databse using a Perl Script.
I want to be able to strip off all the stuff I don't want, an exmple of the html is-
<div class="postbody">
<h3><a href "foo">Re: John Smith <span class="posthilit">England</span></a></h3>
<div class="content">Is C# better than Visula Basic?</div>
</div>
Therefore I would want to import into the database
- Name: John Smith.
- Lives in: England.
- Commented: Is C# better than Visula Basic?
I have started to create a Perl script but it needs to be changed to work for what I want;
use DBI;
open (FILE, "list") || die "couldn't open the file!";
open (F1, ">list.csv") || die "couldn't open the file!";
print F1 "Name\|Lives In\|Commented\n";
while ($line=<FILE>)
{
chop($line);
$text = "";
$add = 0;
open (DATA, $line) || die "couldn't open the data!";
while ($data=<DATA>)
{
if ($data =~ /ds\-div/)
{
$data =~ s/\,//g;
$data =~ s/\"//g;
$data =~ s/\'//g;
$text = $text . $data;
}
}
@p = split(/\\/, $line);
print F1 $p[2];
print F1 ",";
print F1 $p[1];
print F1 ",";
print F1 $p[1];
print F1 ",";
print F1 "\n";
$a = $a + 1;
Any input would be greatly appreciated.
Please do not use regular expressions to parse HTML as HTML is not a regular language. Regular expressions describe regular languages.
It is easy to parse HTML with HTML::TreeBuilder
(and its family of modules):
#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_content(
do { local $/; <DATA> }
);
for ( $tree->look_down( 'class' => 'postbody' ) ) {
my $location = $_->look_down( 'class' => 'posthilit' )->as_trimmed_text;
my $comment = $_->look_down( 'class' => 'content' )->as_trimmed_text;
my $name = $_->look_down( '_tag' => 'h3' )->as_trimmed_text;
$name =~ s/^Re:\s*//;
$name =~ s/\s*$location\s*$//;
print "Name: $name\nLives in: $location\nCommented: $comment\n";
}
__DATA__
<div class="postbody">
<h3><a href="foo">Re: John Smith <span class="posthilit">England</span></a></h3>
<div class="content">Is C# better than Visual Basic?</div>
</div>
Output
Name: John Smith
Lives in: England
Commented: Is C# better than Visual Basic?
However, if you require much more control, have a look at HTML::Parser
as has already been answered by ADW.
Use an HTML parser, like HTML::TreeBuilder to parse the HTML--don't do it yourself.
Also, don't use two-arg open with global handles, don't use chop
--use chomp
(read the perldoc to understand why). Find yourself a newer tutorial. You are using a ton of OLD OLD OLD Perl. And damnit, USE STRICT and USE WARNINGS. I know you've been told to do this. Do it. Leaving it out will do nothing but buy you pain.
Go. Read. Modern Perl. It is free.
my $page = HTML::TreeBuilder->new_from_file( $file_name );
$page->elementify;
my @posts;
for my $post ( $page->look_down( class => 'postbody' ) ) {
my %post = (
name => get_name($post),
loc => get_loc($post),
comment => get_comment($post),
);
push @posts, \%post;
}
# Persist @posts however you want to.
sub get_name {
my $post = shift;
my $name = $post->look_down( _tag => 'h3' );
return unless defined $name;
$name->look_down->(_tag=>'a');
return unless defined $name;
$name = ($name->content_list)[0];
return unless defined $name;
$name =~ s/^Re:\s*//;
$name =~ /\s*$//;
return $name;
}
sub get_loc {
my $post = shift;
my $loc = $post->look_down( _tag => 'span', class => 'posthilit' );
return unless defined $loc;
return $loc->as_text;
}
sub get_comment {
my $post = shift;
my $com = $post->look_down( _tag => 'div', class => 'content' );
return unless defined $com;
return $com->as_text;
}
Now you have a nice data structure with all your post data. You can write it to CSV or a database or whatever it is you really want to do. You seem to be trying to do both.
You'd be much better using the HTML::Parser
module from the CPAN.
精彩评论