How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?_问答_开发者

For example, given an empty file テスト.txt, how would I make a copy called テスト.txt.copy?

My first crack at it managed to access the file and create the new filename, but the copy generated ãƒ†ã‚¹ãƒˆ.txt.copy.

Here was my first crack at it:

#!/usr/bin/env perl

use strict;
use warnings;

use English '-no_match_vars';
use File::Basename;
use Getopt::Long;

use File::Copy;
use Win32;

my (
    $output_relfilepath,
   ) = process_command_line();

open my $fh, '>', $output_relfilepath or die $!;
binmode $fh, ':utf8';
foreach my $short_basename ( glob( '*.txt') ) {

  # skip the output basename if it's in the glob
  if ( $short_basename eq $output_relfilepath ) {
    next;
  }

  my $long_basename = Win32::GetLongPathName( $short_basename );
  my $new_basename  = $long_basename . '.copy';

  print {$fh} sprintf(
                      "short_basename = (%s)\n" .
                      " long_basename = (%s)\n" .
                      "  new_basename = (%s)\n",
                      $short_basename,
                      $long_basename,
                      $new_basename,
                     );
  copy( $short_basename, $new_basename );
}

printf(
       "\n%s done! (%d seconds elapsed)\n",
       basename( $0 ),
       time() - $BASETIME,
      );

# === subroutines ===

sub process_command_line {

  # default arguments
  my %args
    = (
       output_relfilepath => 'output.txt',
      );

  GetOptions(
             'help'                 => sub { print usage(); exit },
             'output_relfilepath=s' => \$args{output_relfilepath},
            );

  return (
          $args{output_relfilepath},
         );
}

sub usage {
  my $script_name = basename $0;

  my $usage = <<END_USAGE;
======================================================================

Test script to copy files with a UTF-8 filenames to files with
different UTF-8 filenames.  This example tries to make copies of all
.txt files with versions that end in .txt.copy.

  usage: ${script_name} (<options>)

options:

  -output_relfilepath <s>   set the output relative file path to <s>.
                            this file contains the short, long, and
                            new basenames.
                            (default: 'output.txt')

----------------------------------------------------------------------

examples:

  ${script_name}

======================================================================
END_USAGE

  return $usage;
}

Here are the contents of output.txt after execution:

short_basename = (BD9A~1.TXT)
 long_basename = (テスト.txt)
  new_basename = (テスト.txt.copy)

I've tried replacing File::Copy's copy command with a system call:

my $cmd 开发者_StackOverflow中文版= "copy \"${short_basename}\" \"${new_basename}\"";
print `$cmd`;

and with Win32::CopyFile:

Win32::CopyFile( $short_basename, $new_basename, 'true' );

Unfortunately, I get the same result in both cases (ãƒ†ã‚¹ãƒˆ.txt.copy). For the system call, the print shows 1 file(s) copied. as expected.

Notes:

I'm running Perl 5.10.0 via Strawberry Perl on Windows 7 Professional
I use the Win32 module to access long filenames
The glob returns short filenames, which I have to use to access the file
テスト = test (tesuto) in katakana
I've read perlunitut and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

This should be possible with the CopyFileW function from Win32API::File, which should be included with Strawberry. I've never messed with Unicode filenames myself, so I'm not sure of the details. You might need to use Encode to manually convert the filename to UTF-16LE (encode('UTF16-LE', $filename)).

You're getting the long filename using Win32, which gives you a UTF-8-encoded string.

However, you're then setting the long filename using plain copy, which uses the C stdlib IO functions. The stdlib functions use the default filesystem encoding.

On modern Linuxes that's usually UTF-8, but on Windows it (sadly) never is, because the system default code page cannot be set to UTF-8. So you'll get your UTF-8 string interpreted as a code page 1252 string on a Western European Windows install, as has happened here. (On a Japanese machine it'd get interpreted as code page 932 — like Shift-JIS — which would come out something like 繝�せ繝�.)

I've not done this in Perl, but I'd suspect the Win32::CopyFile function would be more likely to be able to handle the kind of Unicode paths returned elsewhere in the Win32 module.

Use Encode::Locale:

use Encode::Locale;
use Encode;
use File::Copy;

copy( encode(locale_fs => $short_basename),
      encode(locale_fs => $new_basename) ) || die $!;

I successfully duplicated your problem on my Windows machine (Win XP Simplified Chinese version) and my conclusion is that the problem is caused by the font. Choose a Truetype font rather than Raster fonts and see if everything is okay.

My experiment is this:

I first changed the code page of my Windows Console from the default 936 (GBK) to 65001 (UTF-8). by typing C:>chcp 65001
I wrote a scrip that contains the code: $a= "テスト"; print $a; and saved it as UTF-8.
I ran the script from the Console and found "テスト" became "ãƒ†ã‚¹ãƒˆ", which is exactly the same sympton you described in your question.
I changed the Console Font from Raster Fonts to Lucida Console, the console screen gave me this: "テストストトト", which is still not quite right but I assume it is getting closer to the core of the problem.

So althought I'm not 100% sure but the problem is probably caused by the font.

Hope this helps.

See https://metacpan.org/pod/Win32::Unicode

#!/usr/bin/perl --
use utf8;
use strict;
use warnings;

my @kebabs = (
  "\x{45B}\x{435}\x{432}\x{430}\x{43F}.txt",               ## ћевап.txt
  "ra\x{17E}nji\x{107}.txt",                               ## ražnjić.txt
  "\x{107}evap.txt",                                       ## ćevap.txt
  "\x{43A}\x{435}\x{431}\x{430}\x{43F}\x{447}\x{435}.txt", ## кебапче.txt
  "kebab.txt",
);

{
    use Win32::Unicode qw/ -native /;
    printW "I \x{2665} Perl"; # unicode console out
    mkpathW 'meat';
    chdirW 'meat';
    for my $kebab ( @kebabs ){
        printW "kebabing the $kebab\n";
        open my($fh), '>:raw', $kebab or dieW Fudge($kebab);
        print $fh $kebab              or dieW Fudge($kebab);
        close $fh                     or dieW Fudge($kebab);
    }
}

sub Fudge {
    use Errno();
    join qq/\n/,
      "Error @_",
      map { "  $_" } int( $! ) . q/ / . $!,
      int( $^E ) . q/ / . $^E,
      grep( { $!{$_} } keys %! ),
      q/ /;
}