perl parse html tree buidler or element or parser_问答_开发者

I'm trying to extract some information html using perl. I found out about TreeBuilder and Element and Parser, which one should i use? How would I extract the name and the value of the row below? Also this is embedded in an html structure, the only way to really target which field I want is given the value of the column "Number of directories". Or should I just do a regex on the entire html?

<table cellspacing="0">
    <tbody><tr><td class="black">Number of directories</td><td class="black">:</td><td class="black">&nbsp;80</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;monitored&nbsp;source&nbsp;files</td><td class="black">:</td><td class="black">&nbsp;425</td></tr>
        <tr><td class="black">Number of fu开发者_如何转开发nctions</td><td class="black">:</td><td class="black">&nbsp;6245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;source&nbsp;lines</td><td class="black">:</td><td class="black">&nbsp;3245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;measurement&nbsp;points</td><td class="black">:</td><td class="black">&nbsp;2457</td></tr>
        <tr><td class="red">TER</td><td class="red">:</td><td class="red">&nbsp;<strong>12%</strong>&nbsp;(decision)</td></tr>
    </tbody></table>

If you need to extract data from an HTML table, then

use HTML::TableExtract;

would be a good choice.

There are a few steps.

Use one of HTML::TreeBuilder's constructors to parse the HTML.
Convert the HTML::TreeBuilder object at the root into an HTML::Element by calling elementify.
Understand the structure of your HTML well enough that you can tell HTML::Element::look_down() how to find the bits you are interested in. You can specify criteria in almost any form imaginable.
Use HTML::Element::look_down(), content_list(), left(), right() and related methods to traverse the area of interest and extract data. DO NOT USE traverse()--it was a bad idea.
Pass the data you collected to whatever system asked for it in the first place.

Here's some code:

my $blarg = <<'END_HTML';
<table cellspacing="0">
    <tbody><tr><td class="black">Number of directories</td><td class="black">:</td><td class="black">&nbsp;80</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;monitored&nbsp;source&nbsp;files</td><td class="black">:</td><td class="black">&nbsp;425</td></tr>
        <tr><td class="black">Number of functions</td><td class="black">:</td><td class="black">&nbsp;6245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;source&nbsp;lines</td><td class="black">:</td><td class="black">&nbsp;3245</td></tr>
        <tr><td class="black">Number&nbsp;of&nbsp;measurement&nbsp;points</td><td class="black">:</td><td class="black">&nbsp;2457</td></tr>
        <tr><td class="red">TER</td><td class="red">:</td><td class="red">&nbsp;<strong>12%</strong>&nbsp;(decision)</td></tr>
    </tbody></table>
END_HTML

# Use any of the constructors to get your base object.  See the pod.
my $tree = HTML::TreeBuilder->new_from_content($blarg);

$tree->elementify;  # Make it just a plain HTML::Element object.

# Iterate over a list of rows:  look_down and related functions provide powerful ways to find matching elements.  Read the pod for more details.
my %crud_from_table;
for my $row ( $tree->look_down( _tag => 'tr' ) ) {
    my ($key, $value) = map $_->as_text, $row->content_list;  # assumes two td per row.
    $crud_from_table{$key} = $value;
}

The most important part lies in understanding and being able to describe to look_down() how to find your desired information. Sometimes you can zoom right to it by matching an id. Other times you have to look for the third div of class 'foo' with a table in it. This is also the hardest and the part that I can help you with the least. You are just going to have to experiment.

Good luck.

Of course everyone is going to have their own favorite. I prefer HTML::TokeParser, I find it easy to understand and use (once you get over the hump of how the return arrays work). Of course I have to point you to the SO classic post, reminding you to please not parse HTML with regular expressions.