开发者

How can I add entity declarations via XML::Twig programmatically?

开发者 https://www.devze.com 2023-01-20 19:16 出处:网络
For the life of me I cannot understand the XML::Twig documentation for entity handling. I\'ve got some XML I\'m generating with HTML::Tidy.The call is as follows:

For the life of me I cannot understand the XML::Twig documentation for entity handling.

I've got some XML I'm generating with HTML::Tidy. The call is as follows:

my $tidy = HTML::Tidy->new({
    'indent'          => 1,
    'break-before-br' => 1,
    'output-xhtml'    => 0,
    'output-xml'      => 1,
    'char-encoding'   => 'raw',
});

$str = "foo   bar";
$xml = $tidy->clean("<xml>$str</xml>");

which produces:

<html>
  <head>
    <meta content="tidyp for Linux (v1.02), see www.w3.org" name="generator" />
    <title></title>
  </head>
  <body>foo &nbsp; bar</body>
</html>

XML::Twig (understandably) barfs at the &nbsp;. I want to do some transformations, running it through XML::Twig:

my $twig = XML::Twig->开发者_开发百科new(
  twig_handlers => {... handlers ...}
);

$twig->parse($xml);

The $twig->parse line barfs on the &nbsp;, but I can't figure out how to add the &nbsp; element programmatically. I tried things like:

my $entity = XML::Twig::Entity->new("nbsp", "&#160;");
$twig->entity_list->add($entity);
$twig->parse($xml);

... but no joy.

Please help =)


A dirty, but efficient, trick in a case like this would be to add a fake DTD declaration.

Then XML::Parser, which does the parsing, will assume that the entity is defined in the DTD and won't barf on it.

To get rid of the fake DTD declaration, you can output the root of the twig. If you need a different declaration, create it and replace the current one:

#!/usr/bin/perl 

use strict;
use warnings;

use XML::Twig;

my $fake_dtd= '<!DOCTYPE head SYSTEM "foo"[]>'; # foo may not even exist

my $xml='<html>
  <head>
    <meta content="tidyp for Linux (v1.02), see www.w3.org" name="generator" />
    <title></title>
  </head>
  <body>foo &nbsp; bar</body>
</html>';

XML::Twig->new->parse( $fake_dtd . $xml)->root->print;


use strict;
use XML::Twig;

my $doctype = '<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html [<!ENTITY nbsp "&#160;">]>';
my $xml = '<html><head><meta content="tidyp for Linux (v1.02), see www.w3.org" name="generator" /><title></title></head><body>foo &nbsp; bar</body></html>';

my $xTwig = XML::Twig->new();

$xTwig->safe_parse($doctype . $xml) or die "Failure to parse XML : $@";

print $xTwig->sprint();


There maybe a better way, but the code below worked for me:

my $filter = sub {
    my $text  = shift;
    my $ascii = "\x{a0}";    # non breaking space
    my $nbsp  = '&nbsp;';
    $text =~ s/$ascii/$nbsp/;
    return $text;
};

XML::Twig->new( output_filter => $filter )
         ->parse_html( $xml )
         ->print;
0

精彩评论

暂无评论...
验证码 换一张
取 消