I'm having a lot of trouble with unicode (UTF-16) values and PHP/XML. I want to read a set of unicode values from XML and output the correct glyphs to the browser. I've tried with UTF-8 and I get the same problem.
This is a simple working example I used for my first test:
$text = "\x00\x41";
$text = mb_convert_encoding($text, "ASCII", "UTF-16");
echo $text;
Output of above code:
A
However, when I try to get the values from XML things stop working.
XML:
<glyphs>
<code>0041</code>
<code>0042</code>
<code>0043</code>
<code>0044</code>
<code>0045</co开发者_JAVA百科de>
<code>0046</code>
</glyphs>
In php I read each value from the above xml, split into pairs and format, e.g. \x00\x41, etc.
PHP:
// load xml
$xml = simplexml_load_file('encoding.xml');
if ($xml) {
// get families
foreach($xml->children() as $item) {
$pairs = str_split($item, 2);
$hex = "\x" . $pairs[0] . "\x" . $pairs[1];
// check value...
echo $hex . '<br/>';
$text = mb_convert_encoding($hex, "ASCII", "UTF-16");
echo $text;
}
}
else {
return 'The input is malformed.';
}
Output in browser:
\x00\x41
????
\x00\x42
????
\x00\x43
????
\x00\x44
????
\x00\x45
????
\x00\x46
????
Question marks should be A, B, C, D, E, F.
What am I doing wrong?
Thanks.
Your test program writes for each test character few ASCII characters followed by '
' in ASCII followed by two bytes of UTF-16. This won't work. A file should use only one character encoding at a time.
First, rewrite your script to convert all the output to UTF-16 (or whatever).
Second, it appears that your browser is interpreting your mixed-encoding file as something other than UTF-16, perhaps ISO 8859-1, or Windows Latin 1 which are common defaults. It's unlikely that a browser would interpret a file as UTF-16 unless explicitly directed to (in the HTTP header or content type meta tag). If you left content type unspecified (check if your web server is sending a default) then some browsers attempt to guess the encoding. I doubt any would guess your mixed file was UTF-16.
Don't expect anything to work as you want until you've verified that the browser is interpreting the file according to the content type you specify.
Finally, I recommend using iconv instead of mb_convert_encoding. iconv is better maintained and has a wider set of supported encodings.
"\x00" is hex notation inside a string, which is processed at compile time.
I think that when you use "\x" + "00" the compiler first tries to figure out what "\x" is (I have no clue what is the result), and only afterward concatenates the "00", so the result is not what you expect.
Maybe this question can help, although it is in Java -> Java: Convert String "\uFFFF" into char
EDIT: just following up on the comment. Placing the literal "\x41" in your xml won't help either, because then you are reading a string of 4 characters.
So your problem can be restated as: how to convert a string representation of numerical values in hex to a single character, using UTF-16. It is the same problem as in the question that I linked above, except that you want to do it in php, not Java.
Are you setting the output correctly in your header?
header('Content-Type: text/html; charset=utf-8');
...and also in the HTML head?
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
精彩评论