Possible Duplicate:
Which CPAN module would you recommend for turning HTML into plain text? 开发者_运维百科
Question:
- Is there a module to render HTML, specifically to gather the text, while adhering to font-style tags, such as
<tt>
,<b>
,<i>
, etc and break-line<br>
, similar to Lynx.
For example:
# cat test.html
<body>
<div id="foo" class="blah">
<tt>test<br>
<b>test</b><br>
whatever<br>
test</tt>
</div>
</body>
# lynx.exe --dump test.html
test
test
whatever
test
Note: the second line should be bold.
Lynx is a big program and its html rendering will be non trivial.
How about this:
my $lynx = '/path/to/lynx';
my $html = [ html here ];
my $txt = `$lynx --dump --width 9999 -stdin <<EOF\n$html\nEOF\n`;
Go to search.cpan.org and search for HTML text which will give you lots of options to suit your particular needs. HTML::FormatText is a good baseline, and then branch out into specific variations of it, for example HTML::FormatText::WithLinks if you want to preserve links as footnotes.
I am on Windows so I cannot fully test this but you can adapt htext that comes with HTML::Parser:
#!/usr/bin/perl
use strict; use warnings;
use HTML::Parser;
use Term::ANSIColor;
use HTML::Parser 3.00 ();
my %inside;
sub tag {
my($tag, $num) = @_;
$inside{$tag} += $num;
print " "; # not for all tags
}
sub text {
return if $inside{script} || $inside{style};
my $esc = 1;
if ( $inside{b} or $inside{strong} ) {
print color 'blue';
}
elsif ( $inside{i} or $inside{em} ) {
print color 'yellow';
}
else {
$esc = 0;
}
print $_[0];
print color 'reset' if $esc;
}
HTML::Parser->new(api_version => 3,
handlers => [
start => [\&tag, "tagname, '+1'"],
end => [\&tag, "tagname, '-1'"],
text => [\&text, "dtext"],
],
marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;
精彩评论