开发者

Perl XML::Parser encoding problem

开发者 https://www.devze.com 2023-02-18 07:48 出处:网络
I am writing a Perl script that needs to extract some data from an XML file. The XML file itself is encoded using UTF-8. For some reason, however, what I extract from the file ends up being encoded a

I am writing a Perl script that needs to extract some data from an XML file.

The XML file itself is encoded using UTF-8. For some reason, however, what I extract from the file ends up being encoded as ISO-8859-1. The documentation states that whatever is passed to my handlers should be UTF-8, but it just isn't.

The parser is basically something like this:

my $parser = XML::Parser->new( Handlers => {
    # Some unrelated handlers here
    Char => sub {
        my ( $expat, $string ) = @_;
        if ( exists $data->{$curId}{$curField} ) {
            $data->{$curId}{$curField} .= $string;
        } else {
            $data->{$curId}{$curField} = $string;
        }
    } ,
} );

I have tried the following variants for actually parsing:

  • file parsed directly through $parser->parsefile, no options;
  • file parsed directly through $parser开发者_如何学编程->parsefile, with the ProtocolEncoding option;
  • file opened using open( $handle , "<file.xml" ) then parsed through $parser->parse;
  • file opened using open( $handle , '<:utf8' , "file.xml" ) then parsed through $parser->parse.

In addition, I have tried each version with and without the <?xml encoding="utf-8"?> header in the file.

In all cases, what ends up in $data->{$curId}{$curField} is encoded using ISO-8859-1.

What am I doing wrong?


I know you already found an answer from Michel in the comments, but I'll add a few things. With any encoding, you have to be strict about knowing what you're taking in and what you are sending out. If you need something, don't rely on the environment; eventually someone else will use your program and have a screwed-up environment.

When you are reading a file, don't use the ':utf8' layer. That doesn't care if the octets are actually UTF-8:

 open my $fh, '<:encoding(UTF-8)', $filename or ...;

No matter what you think your output handle is, set it explicitly. There are a variety of ways to do this:

 use open ':encoding(utf8)';

From the command-line, you can use the -C switch with the S flag to make the standard handles UTF-8:

 perl -CS input.xml

Tom Christiansen has a long list of things you need to pay attention to.


Does $data->{$curId}{$curField} have utf8 flag on?

If you concatenate a string with the utf8 flag on with a string that has utf8 flag off, Perl converts the latter to Unicode. This is the usual source of problems.

0

精彩评论

暂无评论...
验证码 换一张
取 消