开发者

perl HTML::TableExtract get stripped text

开发者 https://www.devze.com 2023-03-21 03:28 出处:网络
My tables\' rows in HTML are as follows, <TR bgcolor=\"#FFFFFF\" onmouseover=\"this.bgColor=\'#DBE9FF\';\" onmouseout=\"this.bgColor=\'#FFFFFF\';\">

My tables' rows in HTML are as follows,

<TR bgcolor="#FFFFFF" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#FFFFFF';">
   <TD  class="dlfont">07/01/2011 10:33 AM EDT</B>&nbsp;</TD>
   <TD  class="dlfont">DRB</B>&nbsp;</TD><TD  class="dlfont">Blah</B>&nbsp;</TD>
   <TD  class="dlfont">PPD</B>&nbsp;</TD><TD  class="dlfont"> </B>&nbsp;</TD>
   <TD  class="dlfont">07/01/2011</B>&nbsp;</TD>
   <TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTDRBPPD')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>


<TR bgcolor="#EEEEEE" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#EEEEEE';">
    <TD  class="dlfont">07/01/2011 10:33 AM EDT</B>&nbsp;</TD>
    <TD  class="dlfont">WHPSF</B>&nbsp;</TD>
    <TD  class="dlfont">Blah</B>&nbsp;</TD>
    <TD  class="dlfont"> </B>&nbsp;</TD>
    <TD  class="dlfont"> </B>&nbsp;</TD>
    <TD  class="dlfont">07/01/2011</B>&nbsp;</TD>  
    <TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTWHPSF')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>

When I extract the rows using HTML::TableExtract, the extra chara开发者_StackOverflowcters </B>&nbsp; also appear at the end and form some kind of special character. How can I get rid of this?


I would keep in mind two things when using HTML::TableExtract with the badly formatted HTML in your question

  1. use keep_html=>1 in the HTML::TableExtract constructor
  2. use a regex to remove the </B>&nbsp;, carefully

Here's some Perl code I wrote to prune the </B>&nbsp; out of the table cells, but note, this could change validly formatted HTML to badly formatted HTML if you blindly apply it in all cases.

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TableExtract;

my($f) = @ARGV;
open F,$f;
my $html = join '',<F>;
close F;

### your html didn't include headers, so I added a first table row with td text, time a b c d e f, to help HTML::TableExtract find the table in file, $f 
my $te = HTML::TableExtract->new(
    keep_html=>1,
    headers=>[qw/ time a b c d e f/]);

$te->parse($html);

for my $ts($te->tables)
{
    print "Table(",join(',',$ts->coords),":\n";
    for my $row ($ts->rows)
    {
        for my $cell (@$row)
        {
            next unless $cell;
                    ## maybe add $ at end of regex or other test here to make sure valid cases of <B>...</B>&nbsp; are not affected
            $cell =~ s/<\/B>&nbsp;//i;
            print $cell."\n";
        }
    }
}
0

精彩评论

暂无评论...
验证码 换一张
取 消