开发者

Perl: String literal in module in latin1 - I want utf8

开发者 https://www.devze.com 2023-03-20 12:58 出处:网络
In the Date::Hol开发者_开发百科idays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is \'Nytårsdag\'. What should I do to $x below in order

In the Date::Hol开发者_开发百科idays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is 'Nytårsdag'. What should I do to $x below in order to get a proper utf8-encoded string?

use Date::Holidays::DK;
my $x = is_dk_holiday(2011,1,1);

I tried various combinations of use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect. I also triede to use Encode's decode, with no luck. More specifically,

use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1", 
           is_dk_holiday(2011,1,1)
          );
Dump($x);
print "January 1st is '$x'\n";

gives the output

SV = PV(0x15eabe8) at 0x1492a10
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"]
  CUR = 10
  LEN = 16
January 1st is 'Nyt sdag'

(with an invalid character between t and s).


use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect.

Correct. The utf8 pragma only indicates that the source code of the program is written in UTF-8.

I also tried to use Encode's decode, with no luck.

You did not perceive this correctly, you in fact did the right thing. You now have a string of Perl characters and can manipulate it.

with an invalid character between t and s

You also interpret this wrong, it is in fact the å character.


You want to output UTF-8, so you are lacking the encoding step.

my $octets = encode 'UTF-8', $x;
print $octets;

Please read http://p3rl.org/UNI for the introduction to the topic of encoding. You always must decode and encode, either explicitely or implicitely.


use utf8 only is a hint to the perl interpreter/compiler that your file is UTF-8 encoded. If you have strings with high-bit set, it will automatically encode them to unicode.

If you have a variable that is encoded in iso-8859-1 you must decode it. Then your variable is in the internal unicode format. That's utf8 but you shouldn't care which encoding perl uses internaly.

Now if you want to print such a string you need to convert the unicode string back to a byte string. You need to do a encode on this string. If you don't do an encode manually perl itself will encode it back to iso-8859-1. This is the default encoding.

Before you print your variable $x, you need to do a $x = encode('UTF-8', $x) on it.

For correct handling of UTF-8 you always need to decode() every external input over I/O. And you always need to encode() everything that leaves your program.

To change the default input/output encoding you can use something like this.

use utf8;
use open ':encoding(UTF-8)';
use open ':std';

The first line says that your source code is encoded in utf8. The second line says that every input/ouput should automatically encode in utf8. It is important to notice that a open() also open a file in utf8 mode. If you work with binary files you need to call a binmode() on the handle.

But the second line does not change handling of STDIN,STDOUT or STDERR. The third line will change that.

You can probably use the modul utf8:all that makes this process easier. But it is always good to understand how all this works behind the scenes.

To correct your example. One possible way is this:

#!/usr/bin/env perl
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1", 
           is_dk_holiday(2011,1,1)
          );
Dump($x);
print encode("UTF-8", "January 1st is '$x'\n");
0

精彩评论

暂无评论...
验证码 换一张
取 消