开发者

Perl's YAML::XS and unicode

开发者 https://www.devze.com 2023-03-14 08:46 出处:网络
I am trying to use perl\'s YAML::XS module on unicode letters and it doesn\'t seem working the way it should.

I am trying to use perl's YAML::XS module on unicode letters and it doesn't seem working the way it should.

I write this in the script (which is saved in utf-8)

use utf8;
binmode STDOUT, ":utf8"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;

Instead of something sane, -: Å is printed. According to this link, though, it should be working fine.

Yes, when I YAML::XS::Load it back, I got the correct strings again, but I don't like the fact the dumped string seems to be in some wrong encoding.

Am I doi开发者_Python百科ng something wrong? I am always unsure about unicode in perl, to be frank...

clarification: my console supports UTF-8. Also, when I print it to file, opened with utf8 handle with open $file, ">:utf8" instead of STDOUT, it still doesn't print correct utf-8 letters.


Yes, you're doing something wrong. You've misunderstood what the link you mentioned means. Dump & Load work with raw UTF-8 bytes; i.e. strings containing UTF-8 but with the UTF-8 flag off.

When you print those bytes to a filehandle with the :utf8 layer, they get interpreted as Latin-1 and converted to UTF-8, producing double-encoded output (which can be read back successfully as long as you double-decode it). You want to binmode STDOUT, ':raw' instead.

Another option is to call utf8::decode on the string returned by Dump. This will convert the raw UTF-8 bytes to a character string (with the UTF-8 flag on). You can then print the string to a :utf8 filehandle.

So, either

use utf8;
binmode STDOUT, ":raw"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;

Or

use utf8;
binmode STDOUT, ":utf8"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
utf8::decode($s);
print $s;

Likewise, when reading from a file, you want to read in :raw mode or use utf8::encode on the string before passing it to Load.

When possible, you should just use DumpFile & LoadFile, letting YAML::XS deal with opening the file correctly. But if you want to use STDIN/STDOUT, you'll have to deal with Dump & Load.


It works if you don't use binmode STDOUT, ":utf8";. Just don't ask me why.


I'm using the next for the utf-8 JSON and YAML. No error handling, but can show how to do. The bellow allows me:

  • uses NFC normalisation on input and NO NDF on output. Simply useing everything in NFC
  • can edit the YAML/JSON files with utf8 enabled vim and bash tools
  • "inside" the perl works things like \w regexes and lc uc and so on (at least for my needs)
  • source code is utf8, so can write regexes /á/

My "broilerplate"...

use 5.014;
use warnings;

use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);

use File::Slurp;
use YAML::XS;
use JSON::XS;

run();
exit;

sub run {
    my $yfilein = "./in.yaml"; #input yaml
    my $jfilein = "./in.json"; #input json
    my $yfileout = "./out.yaml"; #output yaml
    my $jfileout = "./out.json"; #output json

    my $ydata = load_utf8_yaml($yfilein);
    my $jdata = load_utf8_json($jfilein);

    #the "uc" is not "fully correct" but works for my needs
    $ydata->{$_} = uc($ydata->{$_}) for keys %$ydata;
    $jdata->{$_} = uc($jdata->{$_}) for keys %$jdata;

    save_utf8_yaml($yfileout, $ydata);
    save_utf8_json($jfileout, $jdata);
}


#using File::Slurp for read/write files
#NFC only on input - and not NFD on output (change this if you want)
#this ensure me than i can edit and copy/paste filenames without problems

sub load_utf8_yaml { return YAML::XS::Load(encode_nfc_read(shift)) }
sub load_utf8_json { return decode_json(encode_nfc_read(shift)) }
sub encode_nfc_read { return encode 'utf8', NFC read_file shift, { binmode => ':utf8' } }
#more effecient
sub rawsave_utf8_yaml { return write_file shift, {binmode=>':raw'}, YAML::XS::Dump shift }
#similar as for json
sub save_utf8_yaml { return write_file shift, {binmode=>':utf8'}, decode 'utf8', YAML::XS::Dump shift }
sub save_utf8_json { return write_file shift, {binmode=>':utf8'}, JSON::XS->new->pretty(1)->encode(shift) }

You can try the next in.yaml

---
á: ä
č: ď
é: ě
í: ĺ
ľ: ň
ó: ô
ö: ő
ŕ: ř
š: ť
ú: ů
ü: ű
ý: ž
0

精彩评论

暂无评论...
验证码 换一张
取 消