开发者

Why do I get an extra newline in the middle of a UTF-8 character with XML::Parser?

开发者 https://www.devze.com 2022-12-24 05:48 出处:网络
I encountered a problem dealing with UTF-8, XML and Perl. The following is the smallest piece of code and data in order to reproduce the problem.

I encountered a problem dealing with UTF-8, XML and Perl. The following is the smallest piece of code and data in order to reproduce the problem.

Here's an XML file that needs to be parsed:

<?xml version="1.0" encoding="utf-8"?>
<test>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
开发者_StackOverflow  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>

  [<words> .... </words> 148 times repeated]

  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
</test>

The parsing is done with this perl script:

use warnings;
use strict;

use XML::Parser;
use Data::Dump;

my $in_words = 0;

my $xml_parser=new XML::Parser(Style=>'Stream');

$xml_parser->setHandlers (
   Start   => \&start_element,
   End     => \&end_element,
   Char    => \&character_data,
   Default => \&default);

open OUT, '>out.txt'; binmode (OUT, ":utf8");
open XML, 'xml_test.xml' or die;
$xml_parser->parse(*XML);
close XML;
close OUT;


sub start_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 1;
  }
  else {
    $in_words = 0;
  }
}

sub end_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 0;
  }
}

sub default {
  # nothing to see here;
}

sub character_data {
  my($parseinst, $data) = @_;

  if ($in_words) {
    if ($in_words) {
      print OUT "$data\n";
    }
  }
}

When the script is run, it produces the out.txt file. The problem is in this file on line 147. The 22th character (which in utf-8 consists of \xd6 \xb8) is split between the d6 and b8 with a new line. This should not happen.

Now, I am interested if someone else has this problem or can reproduce it. And why I am getting this problem. I am running this script on Windows:

C:\temp>perl -v

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Copyright 1987-2007, Larry Wall

Binary build 1003 [285500] provided by ActiveState http://www.ActiveState.com
Built May 13 2008 16:52:49


What happens when you open your input file with an explicit UTF-8 encoding?

 open XML, '<:utf8', 'xml_test.xml' or die;

Never trust anything to get an encoding correct by guessing. Whenever you can, explicitly add the encoding yourself.

Also, are you sure that the input is correct? Does it pass validation with another tool, such as xmllint. I know XML::Parser should catch that sort of thing, but let's verify it.

Also, can you put just the problematic input into a string and print it again without a problem? What happens when you remove just that part of the XML file? Does the same error pop up for another record?


I do not observe this with

C:\Temp> perl -v

This is perl, v5.10.1 built for MSWin32-x86-multi-thread
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2009, Larry Wall

Binary build 1006 [291086] provided by ActiveState http://www.ActiveState.com
Built Aug 24 2009 13:48:26
C:\Temp> perl -MXML::Parser -e "print $XML::Parser::VERSION"
2.36
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号