My tables' rows in HTML are as follows,
<TR bgcolor="#FFFFFF" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#FFFFFF';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">DRB</B> </TD><TD class="dlfont">Blah</B> </TD>
<TD class="dlfont">PPD</B> </TD><TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTDRBPPD')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>
<TR bgcolor="#EEEEEE" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#EEEEEE';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">WHPSF</B> </TD>
<TD class="dlfont">Blah</B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTWHPSF')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>
When I extract the rows using HTML::TableExtract, the extra chara开发者_StackOverflowcters </B>
also appear at the end and form some kind of special character. How can I get rid of this?
I would keep in mind two things when using HTML::TableExtract with the badly formatted HTML in your question
- use
keep_html=>1
in the HTML::TableExtract constructor - use a regex to remove the
</B>
, carefully
Here's some Perl code I wrote to prune the </B>
out of the table cells, but note, this could change validly formatted HTML to badly formatted HTML if you blindly apply it in all cases.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
my($f) = @ARGV;
open F,$f;
my $html = join '',<F>;
close F;
### your html didn't include headers, so I added a first table row with td text, time a b c d e f, to help HTML::TableExtract find the table in file, $f
my $te = HTML::TableExtract->new(
keep_html=>1,
headers=>[qw/ time a b c d e f/]);
$te->parse($html);
for my $ts($te->tables)
{
print "Table(",join(',',$ts->coords),":\n";
for my $row ($ts->rows)
{
for my $cell (@$row)
{
next unless $cell;
## maybe add $ at end of regex or other test here to make sure valid cases of <B>...</B> are not affected
$cell =~ s/<\/B> //i;
print $cell."\n";
}
}
}
精彩评论