I'm new to Perl-HTML things. I'm trying to fetch both the texts and links from a HTML table.
Here is the HTML structure:
<td>Td-Text
<br>
<a href="Link-I-Want" title="title-I-Want">A-Text</a>
</td>
I've figured out that WWW::Mechanize is the easiest module to fetch things I need from the <a>
part, but I'm not sure how to get the text from <td>
. I want the two tasks happen back-to-back because I need to pair each cell's <td>-Text
with its corresponding <a>开发者_高级运维;-Text
in a hash array.
Any help will be much appreciated!
Z.Zen
WWW::Mechanize is good at extracting links, but if you need to get other text, I usually combine it with HTML::TreeBuilder. Something like this:
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
foreach my $td ($tree->look_down(_tag => 'td')) {
# If there's no <a> in this <td>, then skip it:
my $a = $td->look_down(_tag => 'a') or next;
my $tdText = $td->as_text;
my $aText = $a->as_text;
printf("td-text: %s\n a-text: %s\nhref: %s\ntitle: %s\n",
$tdText, $aText, $a->attr('href'), $a->attr('title'));
}
The only problem with this code is that you don't want all of the text in the <td>
tag. How you fix that is up to you. If the $aText
is sufficiently unique, you might do something like:
$tdText =~ s/\Q$aText\E.*//s;
In the worst case, you'd have to write your own function to extract the text elements you want, stopping at the <br>
(or however you determine the stopping point).
I found that HTML::TreeBuilder is a great way of parsing HTML documents and pulling info out of them. In this case, something like:
use HTML::TreeBuilder;
my $page = get($URL);
my $tree = HTML::TreeBuilder->new_from_content($page);
foreach my $cell ($tree->look_down(_tag => "td")) {
my $links = $cell->extract_links();
foreach my $link (@$links) {
print "href: ", $link->attr("href"), "; text: ", $link->as_text, "\n";
}
}
$tree = $tree->delete;
Resources
- HTML::TreeBuilder
- HTML::Element
精彩评论