Parse html using Perl_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-22 02:51 出处：网络

I have the following HTML- <div> <strong>Date: </strong> 19 July 2011 </div> I have been using HTML::TreeBuilder to parse out particular parts of html that are using either

I have the following HTML-

<div>
   <strong>Date: </strong>
       19 July 2011
</div>

I have been using HTML::TreeBuilder to parse out particular parts of html that are using either tags or classes however the aforementioned html is giving me difficulty in trying to extract the date only.

For instance I tried-

for ( $tree->look_down( 开发者_如何学Go'_tag' => 'div'))
{ 
my $date  = $_->look_down( '_tag' => 'strong' )->as_trimmed_text;

But that seems to conflict with an earlier use of <strong>. I am looking to parse out just the '19 July 2011'. I have read the documentation on TreeBuilder but can not find a way of doing this.

How can I do this using TreeBuilder?

The "dump" method is invaluable in finding your way around an HTML::TreeBuilder object.

The solution here is to get the parent element of the element you're interested in (which is, in this case, the <div>) and iterate across its content list. The text you're interested in will be plain text nodes, i.e. elements in the list that are not references to HTML::Element objects.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse(<<END_OF_HTML);
<div>
   <strong>Date: </strong>
       19 July 2011
</div>
END_OF_HTML

my $date;

for my $div ($tree->look_down( _tag => 'div')) {
  for ($div->content_list) {
    $date = $_ unless ref;
  }
}

print "$date\n";

It looks like HTML::Element::content_list() is the function you want. Descendant nodes will be objects while text will just be text, so you can filter with ref() to just get the text part(s).

for ($tree->find('div')) {
  my @content = grep { ! ref } $_->content_list;
  # @content now contains just the bare text portion of the tag
}

You could work around it by removing the text within <strong> from <div>:

my $div      = $tree->look_down( '_tag' => 'div' );
my $div_text = $div->as_trimmed_text;
if ( my $strong = $div->look_down( '_tag' => 'strong' ) ) {
    my $strong_text = $strong->as_trimmed_text;
    my $date        = $div_text;
    $date =~ s/$strong_text\s*//;
}