开发者

Perl Unicode glitch

开发者 https://www.devze.com 2023-02-02 03:34 出处:网络
In this output, why am I getting extra newlines after printing non-ASCII Unicode characters? Platform is Windows Vista and problem occurs after chcp 65001 but not after chcp 850

In this output, why am I getting extra newlines after printing non-ASCII Unicode characters?

Platform is Windows Vista and problem occurs after chcp 65001 but not after chcp 850

C:\>chcp 850
Active code page: 850

C:\>perl unicode_bug_1.pl
Budweiser
Budweiser
Budweiser
Bud─øjovick├¢ Budvar
Bud─øjovick├¢ Budvar
Bud─øjovick├¢ Budvar

C:\>chcp 65001
Active code page: 65001

C:\>perl unicode_bug_1.pl
Budweiser
开发者_如何转开发Budweiser
Budweiser
Budějovický Budvar

Budějovický Budvar

Budějovický Budvar

from this program

#!perl
use strict;
use warnings;

binmode (STDOUT, "encoding(UTF-8)"); # so no "Wide character in print" warning

print "Budweiser\n" for 1..3;
print "Bud\N{U+011B}jovick\N{U+00FD} Budvar\n" for 1..3;


This seems to be a bug in Perl. I had thought it was a bug in Windows code page 65001 not really being supported for the console but I finally made test programs in C and Perl and the problem does not happen in the C version. It happens no matter where the Unicode character occurs in the line but the line you're printing must be wider than the console supports.

Here is my C program:

#include "stdafx.h"

#include "Windows.h"


int _tmain(int argc, _TCHAR* argv[])
{
    BOOL b = SetConsoleOutputCP(65001);
    printf("set console output codepage returned %d\n", b);

    printf("cαfe\n");
    printf("1234567890 café\n");
    printf("1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n");

    return 0;
}

And here is my Perl program:

#

use utf8;

binmode STDOUT, ':utf8';

printf STDOUT "cαfe\n";
printf STDOUT "1234567890 café\n";
printf STDOUT "1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n";

UPDATE

No I was wrong, with the help of some of the guys at #perl on irc.perl.org it turns out to be a bug in the Microsoft API. WriteFile is documented to return the number of bytes written but returns the number of characters written, which depends on the codepage. A bug was filed in March 2010.

There is more discussion in the MSDN forums.

UPDATE 2

I posted Michael Kaplan's blog, "Sorting it all out", about this problem and he responded with the article entitled "Hidden in plain site: a purloined letter kind of a bug report". He's a Microsoft internationalization expert so you will surely find some insights there...


I'm not getting any newlines. Is your command line wide enough to fit your output?

0

精彩评论

暂无评论...
验证码 换一张
取 消