UTF8 Encoding problem - With good examples_问答_开发者

I have the following character encoding issue, somehow I have managed to save data with different character encoding into my database (UTF8) The code and outputs below show 2 sample strings and how they output. 1 of them would need to be changed to UTF8 and the other already is.

How do/should I go about checking if I should encode the string or not? e.g. I need each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?

I am using PHP 5.2, mysql myisam tables:

CREATE TABLE IF NOT EXISTS `entities` (
  ....
  `title` varchar(255) NOT NULL
  ....
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'UTF8 Encode : ', utf8_encode($text)."<br />";
echo 'UTF8 Decode : ', utf8_decode($text)."<br />";
echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text)."<br />";
echo 'IGNORE TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//IGNORE//TRANSLIT", $text)."<br />";
echo 'IGNORE   : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text)."<br />";
echo 'Plain    : ', iconv("ISO-8859-1", "UTF-8", $text)."<br />";
?>

Output 1:

Original : France Télécom
UTF8 Encode : France TÃ©lÃ©com
UTF8 Decode : France T�l�com
TRANSLIT : France TÃ©lÃ©com
IGNORE TRANSLIT : France TÃ©lÃ©com
IGNORE : France TÃ©lÃ©com
Plain : France TÃ©lÃ©com

Output 2:###

Original : Cond� Nast Publications
UTF8 Encode : Condé Nast Publications
UTF8 Decode : Cond?ast Publications
TRANSLIT : Condé Nast Publications
IGNORE TRANSLIT : Condé Nast Publications
IGNORE : Condé Nast Publications
Plain : Condé Nast Publications

Thanks for you time on this one. Character encoding and I don't get on very well!

UPDATE:

echo strlen($string)."|".strlen(utf8_encode($string))."|";
echo (strlen($string)!==strlen(utf8_encode($string))) ? $string : utf8_encode($string);
echo "<br />";
echo strlen($string)."|".strlen(utf8_decode($string))."|";
echo (strlen($string)!==strlen(utf8_decode($string))) ? $string : utf8_decode($string);
echo "<br />";

23|24|Cond� Nast Publications
23|21开发者_如何学Python|Cond� Nast Publications

16|20|France Télécom
16|14|France Télécom

This may be a job for the mb_detect_encoding() function.

In my limited experience with it, it's not 100% reliable when used as a generic "encoding sniffer" - It checks for the presence of certain characters and byte values to make an educated guess - but in this narrow case (it'll need to distinguish just between UTF-8 and ISO-8859-1 ) it should work.

<?php
$text = $entity['Entity']['title'];

echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");

echo 'Detected encoding '.$enc."<br />";

echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";

?>

you may get incorrect results for strings that do not contain special characters, but that is not a problem.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />";
?>

Output:

Original : France Télécom
Encoding::toUTF8 : France Télécom

Original : Cond� Nast Publications
Encoding::toUTF8 : Condé Nast Publications

You dont need to know what the encoding of your strings is as long as you know it is either on Latin1 (iso 8859-1), Windows-1252 or UTF8. The string can have a mix of them too.

Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip

I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Another way, maybe faster and less unreliable:

echo (strlen($str)!==strlen(utf8_decode($str)))
  ? $str                //is multibyte, leave as is
  : utf8_encode($str);  //encode

It compares the length of the original string and the utf8_decoded string. A string that contains a multibyte-character, has a strlen which differs from the similar singlebyte-encoded strlen.

For example:

strlen('Télécom')

should return 7 in Latin1 and 9 in UTF8

I made these little 2 functions that work well with UTF-8 and ISO-8859-1 detection / conversion...

function detect_encoding($string)
{
    //http://w3.org/International/questions/qa-forms-utf-8.html
    if (preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*$%xs', $string))
        return 'UTF-8';

    //If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list.
    //if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
    return mb_detect_encoding($string, array('UTF-8', 'ASCII', 'ISO-8859-1', 'JIS', 'EUC-JP', 'SJIS'));
}

function convert_encoding($string, $to_encoding, $from_encoding = '')
{
    if ($from_encoding == '')
        $from_encoding = detect_encoding($string);

    if ($from_encoding == $to_encoding)
        return $string;

    return mb_convert_encoding($string, $to_encoding, $from_encoding);
}

If your database contains strings in 2 different charsets, what I would do instead of plaguing all your application code with charset detection / conversion is to writhe a "one shot" script that will read all of your tables records and update their strings to the correct format (I would pick UTF-8 if I were you). This way your code will be cleaner and simpler to maintain.

Just loop records in every tables of your database and convert strings like this:

//if the 3rd param is not specified the "from encoding" is detected automatically
$newString = convert_encoding($oldString, 'UTF-8');

I didn't try your samples here, but from past experiences, there is a quick fix for this. Right after database connection execute the following query BEFORE running any other queries:

SET NAMES UTF8;

This is SQL Standard compliant, and works well with other databases, like Firebird and PostgreSQL.

But remember, you need ensure UTF-8 declarations on other spots too in order to make your application works fine. Follow a quick checklist.

All files should be saved as UTF-8 (preferred without BOM [Byte Order Mask])
Your HTTP Server should send the encoding header UTF-8. Use Firebug or Live HTTP Headers to inspect.
If your server compress and/or tokenize the response, you may see header content as chunked or gzipped. This is not a problem if you save your files as UTF-8 and
Declare encoding into HTML header, using proper meta tag.
Over all application (sockets, file system, databases...) does not forget to flag up UTF-8 everytime you can. Making this when opening a database connection or so helps you to not need to encode/decode/debug all the time. Grab'em by root.

What database do you use?
You need to know the charset of original string before you convert it to utf-8, if it's in the ISO-8859-1 (latin1) then utf8_encode() is the easiest way, otherwise you need to use either icov or mbstring lib to convert and both of these need to know the charset of input in order to covert properly.
Do you tell your database about charset when you insert/select data?