开发者

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?

开发者 https://www.devze.com 2023-02-01 06:23 出处:网络
Note below how ã 开发者_运维百科changes to a. NOTE2: Before you blame this on CMD.EXE and Windows pipe weirdness, see Experiment 2 below which gets a similar problem using File::Find.

Note below how ã 开发者_运维百科changes to a. NOTE2: Before you blame this on CMD.EXE and Windows pipe weirdness, see Experiment 2 below which gets a similar problem using File::Find.

The particular problem I'm trying to fix involves working with image files stored on a local drive, and manipulating the file names which may contain foreign characters. The two experiments shown below are intermediate debugging steps.

The ã character is common in latin languages. e.g. http://pt.wikipedia.org/wiki/Cão

Experiment 1

Look closely, note how cão becomes cao.

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?

Experiment 2

Here I tried using File::Find instead of piped input, in case the issue was with the Windows implementation of the | shell operator. The issue actually gets worse, as the ~a becomes Pi:

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?


Debugging update:

I tried some of the tricks listed at http://perldoc.perl.org/perlunicode.html, e.g. use utf8, use feature 'unicode_strings', etc, to no avail.


Environment and Version Info

The OS is Windows 7, 64-bit.

The Perl is:

This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x64-multi-thread
(with 8 registered patches, see perl -V for more detail)

Copyright 1987-2010, Larry Wall

Binary build 1202 [293621] provided by ActiveState http://www.ActiveState.com
Built Sep  6 2010 22:53:42


Perl, as with many other scripting languages, is built on the C runtime.

On Windows, the standard MS C runtime for narrow (byte) characters uses an encoding which defaults to the Windows system encoding (‘ANSI code page’) for IO activities such as opening files or writing to the console.

The ANSI code page is always a locale-specific encoding: usually single-byte, but multi-byte in some locales (eg China, Japan etc). It is never UTF-8 or anything else capable of reproducing the whole of Unicode; which characters Perl IO can cope with is dependent on the Windows locale (“language for non-Unicode programs” setting).

Whilst console apps can be given UTF-8 using the chcp 65001 command, there are a number of serious inconsistencies which come up with doing this. This causes difficulty for a lot of tools on Windows and is something Microsoft really needs to fix, but so far their attitude is that Unicode Equals UTF-16; everyone who wants Unicode to work must use the widechar interfaces.

So you won't currently be able to deal with files that use non-ASCII filenames reliably in Perl on Windows. Sorry.

You could try Python (which added special Windows-only filename handling to get around this problem in version 2.3 onwards; see PEP 277), or one of the Unicode-aware Windows Scripting Host languages. Either way, getting Unicode out to the console on Windows still has more pitfalls.


The following 3 liner works as expected on my newly minted ActivePerl 5.12.2:

use utf8;
open($file, '>:encoding(UTF-8)', "output.txt") or die $!;
print $file "さっちゃん";

I think the culprit is cmd.exe.

0

精彩评论

暂无评论...
验证码 换一张
取 消