开发者

Parse specific text from html using Perl

开发者 https://www.devze.com 2023-03-18 16:18 出处:网络
I have an html page that has particular text that I w开发者_开发百科ant to parse into a databse using a Perl Script.

I have an html page that has particular text that I w开发者_开发百科ant to parse into a databse using a Perl Script.

I want to be able to strip off all the stuff I don't want, an exmple of the html is-

<div class="postbody">
        <h3><a href "foo">Re: John Smith <span class="posthilit">England</span></a></h3>
        <div class="content">Is C# better than Visula Basic?</div>
    </div>

Therefore I would want to import into the database

  1. Name: John Smith.
  2. Lives in: England.
  3. Commented: Is C# better than Visula Basic?

I have started to create a Perl script but it needs to be changed to work for what I want;

    use DBI;

    open (FILE, "list") || die "couldn't open the file!";

    open (F1, ">list.csv") || die "couldn't open the file!";

    print F1 "Name\|Lives In\|Commented\n";

    while ($line=<FILE>)

    {

    chop($line);
    $text = "";
    $add = 0;
    open (DATA, $line) || die "couldn't open the data!";
    while ($data=<DATA>)

    {
    if ($data =~ /ds\-div/)
    {
    $data =~ s/\,//g;
    $data =~ s/\"//g;
    $data =~ s/\'//g;
    $text = $text . $data;
    }

    }

    @p = split(/\\/, $line);
    print F1 $p[2];
    print F1 ",";
    print F1 $p[1];
    print F1 ",";
    print F1 $p[1];
    print F1 ",";  

    print F1 "\n";
    $a = $a + 1;

Any input would be greatly appreciated.


Please do not use regular expressions to parse HTML as HTML is not a regular language. Regular expressions describe regular languages.

It is easy to parse HTML with HTML::TreeBuilder (and its family of modules):

#!/usr/bin/env perl

use warnings;
use strict;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content(
    do { local $/; <DATA> }
);

for ( $tree->look_down( 'class' => 'postbody' ) ) {
    my $location = $_->look_down( 'class' => 'posthilit' )->as_trimmed_text;
    my $comment  = $_->look_down( 'class' => 'content' )->as_trimmed_text;
    my $name     = $_->look_down( '_tag'  => 'h3' )->as_trimmed_text;
    $name =~ s/^Re:\s*//;
    $name =~ s/\s*$location\s*$//;

    print "Name: $name\nLives in: $location\nCommented: $comment\n";
}

__DATA__
<div class="postbody">
    <h3><a href="foo">Re: John Smith <span class="posthilit">England</span></a></h3>
    <div class="content">Is C# better than Visual Basic?</div>
</div>

Output

Name: John Smith
Lives in: England
Commented: Is C# better than Visual Basic?

However, if you require much more control, have a look at HTML::Parser as has already been answered by ADW.


Use an HTML parser, like HTML::TreeBuilder to parse the HTML--don't do it yourself.

Also, don't use two-arg open with global handles, don't use chop--use chomp (read the perldoc to understand why). Find yourself a newer tutorial. You are using a ton of OLD OLD OLD Perl. And damnit, USE STRICT and USE WARNINGS. I know you've been told to do this. Do it. Leaving it out will do nothing but buy you pain.

Go. Read. Modern Perl. It is free.

my $page = HTML::TreeBuilder->new_from_file( $file_name );
$page->elementify;

my @posts;
for my $post ( $page->look_down( class => 'postbody' ) ) {

    my %post = (
        name    => get_name($post),
        loc     => get_loc($post),
        comment => get_comment($post),
    );

    push @posts, \%post;
}

# Persist @posts however you want to.

sub get_name {
    my $post = shift;
    my $name = $post->look_down( _tag => 'h3' );
    return unless defined $name;

    $name->look_down->(_tag=>'a');
    return unless defined $name;        

    $name = ($name->content_list)[0];
    return unless defined $name;        

    $name =~ s/^Re:\s*//;
    $name =~ /\s*$//;

    return $name;
}

sub get_loc {
    my $post = shift;
    my $loc = $post->look_down( _tag => 'span', class => 'posthilit' );

    return unless defined $loc;

    return $loc->as_text;
}

sub get_comment {
    my $post = shift;
    my $com = $post->look_down( _tag => 'div', class => 'content' );

    return unless defined $com;

    return $com->as_text;
}

Now you have a nice data structure with all your post data. You can write it to CSV or a database or whatever it is you really want to do. You seem to be trying to do both.


You'd be much better using the HTML::Parser module from the CPAN.

0

精彩评论

暂无评论...
验证码 换一张
取 消