开发者

Extract table rows with a specific element from HTML using HTML::TableExtract in perl

开发者 https://www.devze.com 2023-03-23 19:41 出处:网络
I\'ve learned the hard way that regexes cannot adequately parse html, prior to finding post after post about it.

I've learned the hard way that regexes cannot adequately parse html, prior to finding post after post about it.

I am trying to extract unread PMs from a webpage that sit in a table. It's the only table on the page being requested, s开发者_开发百科o that part is nice. Each row is a set of columns regarding the PM. The class of the TR informs of an unread/read PM. - which is what is catching me.

I tried to use HTML::TableExtract which almost worked perfectly, except I can't figure out how to check the TR element.

Example Table Structure:

<table>
    <tr class="header">
        <td></td>
        <td>Subject</td>
        <td>Sender</td>
        <td>Date</td>
    </tr>
    <tr class="unread">
        <td>checkbox for multi-edit stuff</td>
        <td>Example of an unread PM</td>
        <td>Me</td>
        <td>Jul 30, 2011</td>
    </tr>
    <tr class="read">
        ....   
    </tr>
</table>

Using HTML::TableExtract I was able to get everything except the unread/read classes. Like so:

$t = HTML::TableExtract->new(keep_html);
$t->parse($lwp_data);
foreach $t2 ($t->tables) {
    foreach $row ($t2->rows) {
#Can't find a way to search for <tr class="unread". As
#Attribute data is stripped at this point by HTML::TableExtract

        #This now shows EVERY PM in the list
        print join(',', @$row), "\n";
    }
 }

How else could I parse this out, and get only the TR's with class="unread"?

Searches resulted in way too complex answers or answers that don't quite solve my problem.

Here's the most recent method I'm using to get what I want (And is working, I just wonder how to do it a better way):

 while ($page =~ m/(unreadpm.*?\/tr)/sg) {
      $data = $1;
      if ($data =~ m(value="(\d+)".*?<a href="(inbox.php\?action=viewconv&amp;id=\d+)">(.*?)</a>\n</strong>\s+</td>\n\s+<td>(.*?)</td>)sg) {
           my ($id,$link,$subject,$user) = ($1, $2, $3, $4);
           if ($user =~ m(user\.php\?id=\d+">(.*?)</a>)) {
                $user = $1;
           }

           if (grep $_ eq $id, @ids) {
                print "Message ID: $id already listed\n"
           } else {
                print "Emailing - Subject: $subject by $user. ID: $id Link: $link ...";
                send_email($subject,$user,$link);
                print "done.\n";
                push @ids, $id;
           }
      }
 }


I can recommend HTML::TreeBuilder in combination with XML::LibXML to do the job.

my $tree = HTML::TreeBuilder->new_from_content( $html );
my $xml  = $tree->as_XML;
my $doc = XML::LibXML->load_xml(string => $xml);

You can then use findvalue to find the <tr> nodes using XPath expressions.

Using HTML::Selector::XPath you can even use CSS selectors to get to the <tr>.


If I've understood the question then I would do something like:

@html_lines = (use curl or otherwise to retrieve the html)

$GET_LINE = 0;

foreach $line (@html_lines)
{
  if ($line =~ /\<tr class="unread"\>/)
  {
      $GET_LINE = 1;
      next;
  }

  if ( ($line =~ |\</tr\>|) && ($GET_LINE) ) 
  { 
      $GET_LINE = 0;
      next;
  }

  if ($GET_LINE)
  {
     #process the <td> lines
  }
}

NB: Im not guaranteeing that syntax is correct, but you get the picture...

0

精彩评论

暂无评论...
验证码 换一张
取 消