开发者

Perl: Problem with changing encoding in the middle of reading a file

开发者 https://www.devze.com 2023-02-16 16:31 出处:网络
I am using Perl to load some \'macro\' files. These macros can, however, be encoded in various encodings, so there is a directive defined for users writing their macros (i.e.

I am using Perl to load some 'macro' files. These macros can, however, be encoded in various encodings, so there is a directive defined for users writing their macros (i.e.

#encoding iso-8859-2

at the beginning of the macro).

Every time this directive is encountered in the macro, a function setting encoding is called and looks sth like this:

sub change_encoding {
  my ($file_handle, $encoding) = @_;
  $file_handle->flush();
  binmode($file_handle);           # get rid of IO layers
  binmode($file_handle,":encoding($encoding)");
}

The problem is that when I read the macro using standard

while($line = <$file_handle>){
  process_macro($line);
}

I got messages saying "utf8 "\xXY" does not map to Unicode", but only if characters with diacritics is near the #encoding directive. I tried several examples and I was able to have half of the string with \xXY codes and other half of the string with correctly decoded characters, like here:

sub macro5_fn {
  print "\xBElu\xBBou\xE8k\xFD k\xF9\xF2 úpěl ďábelské ódy\n";
}

If I put more comments before the function, all the chara开发者_开发百科cters are OK:

sub macro5_fn {
  print "žluťoučký kůň úpěl ďábelské ódy\n";
}

Simply said, the number of correctly decoded characters depends on the distance of these characters from the #encoding directive, the ones that are close are not decoded correctly.

It seems to me that this is an issue of Perl and PerlIO (not) flushing the buffer. Or am I doing something wrong?

Thank you for your answers.


The problem is that <> reads more than just one line, so the next line or so is being interpreted under the old encoding before you ever see the #encoding directive for the new.

Your best bet is probably to read the file in binary mode and use the Encode module to decode each line from the current encoding.

0

精彩评论

暂无评论...
验证码 换一张
取 消