The post is updated. Please kindly jump to the Solution part, if you've already read the posted question. Thanks!
Here's the minimized code to exhibit my problem:
The input data file for test has been saved by Window's built-in Notepad as UTF-8 encoding. It has the following three lines:
abacus æbәkәs abalone æbәlәuni abandon әbændәn
The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding. It contains the following code:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";
In the output, the hash table seems to be okay:
$VAR1 = { 'abalone' => 'æbәlәuni ', 'abandon' => 'әbændәn', 'abacus' => 'æbәkәs ' };
But it is actually not, because I only get two values instead of three:
æbәlәu开发者_StackOverflow中文版ni әbændәn
Perl gives the following warning message:
Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i
n> line 3.
where's the problem? Can someone kindly explain? Thanks.
The Solution
Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :) As @Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.
To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
Now, the output is exactly what I expected:
$VAR1 = { 'abalone' => 'æbәlәuni ', 'abandon' => 'әbændәn', 'abacus' => 'æbәkәs ' }; æbәkәs æbәlәuni әbændәn
Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.
Finally thanks again to all of you. And thank you, @Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.
Note To clarify a little more, if I use:
open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
The output is this:
$VAR1 = { 'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni ", 'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n", "\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s " }; æbәlәuni әbændәn
And the warning message:
Use of uninitialized value in print at C:\hash_test.pl line 13, line 3.
I find the warning message a little suspicious. It tells you that the $in
filehandle is at line 3 when it should be at line 4 after having read the last line.
When I tried your code, I saved the input file using GVim which is configured on my system to save as UTF-8, I did not see the problem. Now that I tried it with Notepad, looking at the output file, I see:
"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s "
where \x{feff}
is the BOM.
In your Dumper output, there is spurious blank before abacus
(where you had not specified :utf8
for the output handle).
As I had mentioned originally (lost to the umpteen edits on this post — thanks for the reminder hobbs), specify '<:utf8'
when you are opening the input file.
If you want to read/write UTF8 files, you should make sure that you are actually reading them in as UTF8.
#! /usr/bin/env perl
use Data::Dumper;
open my $in, '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";
my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";
If you want it to be more robust, it is recommended to use :encoding(utf8)
instead of :utf8
, for reading a file.
open my $in, '<:encoding(utf8)', "hash_test.txt";
Read PerlIO for more information.
I think your answer may be sitting right in front of you. The output from Data::Dumper
which you posted is:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
Notice the character between the '
and abacus
? You tried to access the third value via $hash{abacus}
. This is incorrect because of that character before abacus
in the Dumper()
hash. You could try plugging it into a loop which should take care of it:
foreach my $k (keys %hash) {
print $out $hash{$k};
}
split/\s/ instead of split/\t/
Works For Me. Are you sure your example matches your actual code and data?
精彩评论