I have a file with the following content with some characters are UTF-8 hex encoded in the string literal:
<root>
<element type=\"1\">\"Hello W\xC3\x96rld\"</element>
</root>
I want to read the file and decode the UTF-8 hex encoded characters in the file to the actual unicode characters they represent and then write to a new file. Given the above content, the new file should look like the following when you open it in a text editor with UTF-8 encoding:
<root>
<element type=\"1\">\"Hello WÖrld\"</element>
</root>
Notice the double quotes are still escaped and the UTF-8 hex encoded \xC3\x96
has now become Ö (U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS).
I have got code that is partially working, as follows:
#! /usr/bin/perl -w
use strict;
use Encode::Escape;
while (<>)
{
# STDOUT is redirected to a new file.
print decode 'unicode-escape', $_;
}
The problem however, all the other escape sequences such as \"
are being decoded as well by decode 'unicode-escape', $_
. So in the end, I get the following:
<root>
<element type="1">"Hello WÖrld"</element>
</root>
I have tried reading the file in UT开发者_JS百科F-8 encoding and/or using Unicode::Escape::unescape
such as
open(my $UNICODESFILE, "<:encoding(UTF-8)", shift(@ARGV));
Unicode::Escape::unescape($line);
but neither of them decode the \xhh
escape sequences.
Basically all I want is the behavior of decode 'unicode-escape', $_
, but that it should only decode on \xhh
escape sequences and ignore other escape sequences.
Is this possible? Is using decode 'unicode-escape', $_
appropriate for this case? Any other way? Thanks!
Find groups of \xNN characters and process them, I guess:
s{((?:\\x[0-9A-Fa-f]{2})+)}{decode 'unicode-escape', $1}ge
精彩评论