How to cope with different encodings of xls files in PHP?_问答_开发者

I'm developing开发者_如何学Go a php script involving parsing data from xls files. I'm using library phpexcelreader. All mostly works, but I stumbled upon a strange problem. Some files are parsed incorrecty. Looks like xls files may use different character encodings internally. At least, then I pipe output from my script through iconv -f cp1251 -t utf8, strings get corrected.

Phpexcelreader has an option for specifing output encoding, but looks like it lacks an ability detect input encoding. Any ideas?

The _defaultEncoding property of the workbook object can be set to contain the charset used by the Excel file, and this is then used to handle conversion to UTF-16LE by the reader, but it makes no effort to identify the internal charset itself.

If you define

define('SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE',  0x0042);

among the other SPREADSHEET_EXCEL_READER_TYPE definitions, and then modify the switch statement starting at line 464 to include a case for SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE. The logic for this case needs to be something like:

$length = $this->_GetInt2d($this->_data, $pos + 2);
$recordData = substr($this->_data, $pos + 4, $length);

// move stream pointer to next record
$pos += 4 + $length;

// offset: 0; size: 2; code page identifier
$codepage = $this->_GetInt2d($recordData, 0);
$codepage = $this->_CodePageNumberToName($codepage)

Recreate the _GetInt2d method (that seems to have been stripped from the code at some point) as

function _GetInt2d($data, $pos)
{
    return ord($data[$pos]) | (ord($data[$pos + 1]) << 8);
}

and create a _CodePageNumberToName method to return the codepage name from its numeric value:

function _CodePageNumberToName($codePage = '1252')
{
    switch ($codePage) {
        case 367:   return 'ASCII';     break;  //  ASCII
        case 437:   return 'CP437';     break;  //  OEM US
        case 720:   throw new Exception('Code page 720 not supported.');
                                        break;  //  OEM Arabic
        case 737:   return 'CP737';     break;  //  OEM Greek
        case 775:   return 'CP775';     break;  //  OEM Baltic
        case 850:   return 'CP850';     break;  //  OEM Latin I
        case 852:   return 'CP852';     break;  //  OEM Latin II (Central European)
        case 855:   return 'CP855';     break;  //  OEM Cyrillic
        case 857:   return 'CP857';     break;  //  OEM Turkish
        case 858:   return 'CP858';     break;  //  OEM Multilingual Latin I with Euro
        case 860:   return 'CP860';     break;  //  OEM Portugese
        case 861:   return 'CP861';     break;  //  OEM Icelandic
        case 862:   return 'CP862';     break;  //  OEM Hebrew
        case 863:   return 'CP863';     break;  //  OEM Canadian (French)
        case 864:   return 'CP864';     break;  //  OEM Arabic
        case 865:   return 'CP865';     break;  //  OEM Nordic
        case 866:   return 'CP866';     break;  //  OEM Cyrillic (Russian)
        case 869:   return 'CP869';     break;  //  OEM Greek (Modern)
        case 874:   return 'CP874';     break;  //  ANSI Thai
        case 932:   return 'CP932';     break;  //  ANSI Japanese Shift-JIS
        case 936:   return 'CP936';     break;  //  ANSI Chinese Simplified GBK
        case 949:   return 'CP949';     break;  //  ANSI Korean (Wansung)
        case 950:   return 'CP950';     break;  //  ANSI Chinese Traditional BIG5
        case 1200:  return 'UTF-16LE';  break;  //  UTF-16 (BIFF8)
        case 1250:  return 'CP1250';    break;  //  ANSI Latin II (Central European)
        case 1251:  return 'CP1251';    break;  //  ANSI Cyrillic
        case 0:                                 //  CodePage is not always correctly set when the xls file was saved by Apple's Numbers program
        case 1252:  return 'CP1252';    break;  //  ANSI Latin I (BIFF4-BIFF7)
        case 1253:  return 'CP1253';    break;  //  ANSI Greek
        case 1254:  return 'CP1254';    break;  //  ANSI Turkish
        case 1255:  return 'CP1255';    break;  //  ANSI Hebrew
        case 1256:  return 'CP1256';    break;  //  ANSI Arabic
        case 1257:  return 'CP1257';    break;  //  ANSI Baltic
        case 1258:  return 'CP1258';    break;  //  ANSI Vietnamese
        case 1361:  return 'CP1361';    break;  //  ANSI Korean (Johab)
        case 10000: return 'MAC';       break;  //  Apple Roman
        case 32768: return 'MAC';       break;  //  Apple Roman
        case 32769: throw new Exception('Code page 32769 not supported.');
                                        break;  //  ANSI Latin I (BIFF2-BIFF3)
        case 65001: return 'UTF-8';     break;  //  Unicode (UTF-8)
    }
}

And store the returned value in $_defaultEncoding

Alternatively, switch to an Excel reader that can handle the codepage correctly in the first place

My 2 cents:

I just replaced the encodeUTF16 with this one

 function _encodeUTF16($string, $check = false) {
    if ($check) {
        $from = api_detect_encoding($string);
        $string = api_convert_encoding($string, $this->_defaultEncoding, $from);
        return $string;    
    }
    $string =  api_convert_encoding($string, $this->_defaultEncoding, 'UTF-16LE');
    return $string;

and change the line 568

$retstr = ($asciiEncoding) ? $this->_encodeUTF16($retstr, true) : $this->_encodeUTF16($retstr);

these functions api_detect_encoding and api_convert_encoding can be found in this lib:

http://code.google.com/p/chamilo/source/browse/main/inc/lib/internationalization.lib.php?repo=classic

for Persian language i added one line after the line 568 or in other version 336.

$retstr = ($asciiEncoding) ? $retstr : $this->_encodeUTF16($retstr);

$retstr=iconv("UTF-16LE","UTF-8", $retstr);

this code support Persian but u can't use English any more.

How to cope with different encodings of xls files in PHP?

精彩评论

关注公众号

热门标签

图文推荐

How to cope with different encodings of xls files in PHP?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：