Is there a way to give a whitelist to the module that it would preserve certain tags?
N开发者_运维技巧ow markup as below
<div><b>test</b></div>
Stripped with this code
my $hs = HTML::Strip->new();
open FILE, "<test.markup";
$raw_html=<FILE>;
my $clean_text = $hs->parse( $raw_html );
$hs->eof;
Produces output below
test
However I would like to get with <b>
tag whitelisted output below.
<b>test</b>
EDIT, ONE SOLUTION
Using HTML::StripScripts::Parser
my $hss = HTML::StripScripts::Parser->new(
{
Context => 'Inline',
EscapeFiltered => 0,
BanAllBut => [qw(i b u)],
},
strict_comment => 0,
strict_names => 0,
);
$hss->filter_html("<div><b>test</b></div>");
$cooked = $hss->filtered_document;
$cooked =~ s/<!--filtered-->//g;
print $cooked; // <b>test</b>
Reading both the Perl wrapper and the underlying XS code, there's no whitelist capability.
It is possible to add, though not 100% trivial - the code already checks tag names for "strip" tags like <script>
and is only 200LOC.
As another approach, RegEx book from O'Reilly has a regular expression recipe that can strip HTML tags (including whitelist capability).
If you'd rather not mess with RegEx, try HTML::StripScripts::Parser
- it seems it uses whitelists
精彩评论