开发者

PHP construct a Unicode string?

开发者 https://www.devze.com 2023-01-15 20:11 出处:网络
Given a Unicode decimal or hex number for a character that\'s wanting to be output from a CLI PHP script, how can PHP generate it? The chr() function seems to not generate the proper output. Here\'s m

Given a Unicode decimal or hex number for a character that's wanting to be output from a CLI PHP script, how can PHP generate it? The chr() function seems to not generate the proper output. Here's my test script, using the Section Break character U+00A7 (A7 in hex, 167 in decimal, should be represented as C2 A7 in UTF-8) as a test:

<?php
echo "Section sign: ".chr(167)."\n"; // Using CHR function
echo "Section sign: ".chr(0xA7)."\n";
echo "Section sign: ".pack("c", 0xA7)."\n"; // Using pack function?
echo "Section sign: 开发者_如何学Python§\n"; // Copy and paste of the symbol into source code

The output I get (via a SSH session to the server) is:

Section sign: ?
Section sign: ?
Section sign: ?
Section sign: §

So, that proves that the terminal font I'm using has the Section Break character in it, and the SSH connection is sending it along successfully, but chr() isn't constructing it properly when constructing it from the code number.

If all I've got is the code number and not a copy/paste option, what options do I have?


Assuming you have iconv, here's a simple way that doesn't involve implementing UTF-8 yourself:

function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}


PHP has no knowledge of Unicode when excluding the mb_ functions and iconv. You'll have to UTF-8 encode the character yourself.

For that, Wikipedia has an excellent overview on how UTF-8 is structured. Here's a quick, dirty and untested function based on that article:

function codepointToUtf8($codepoint)
{
    if ($codepoint < 0x7F) // U+0000-U+007F - 1 byte
        return chr($codepoint);
    if ($codepoint < 0x7FF) // U+0080-U+07FF - 2 bytes
        return chr(0xC0 | ($codepoint >> 6)).chr(0x80 | ($codepoint & 0x3F);
    if ($codepoint < 0xFFFF) // U+0800-U+FFFF - 3 bytes
        return chr(0xE0 | ($codepoint >> 12)).chr(0x80 | (($codepoint >> 6) & 0x3F).chr(0x80 | ($codepoint & 0x3F);
    else // U+010000-U+10FFFF - 4 bytes
        return chr(0xF0 | ($codepoint >> 18)).chr(0x80 | ($codepoint >> 12) & 0x3F).chr(0x80 | (($codepoint >> 6) & 0x3F).chr(0x80 | ($codepoint & 0x3F);
}


Don't forget that UTF-8 is a variable-length encoding.

§ is not included in the first 128 (ASCII) characters that UTF-8 is able to display in one byte. § is a multi-byte character in UTF-8, prepended by a c2 byte that signifies first byte of a two-byte sequence.. This should work:

echo "Section sign: ".chr(0xC2).chr(0xA7)."\n"; 


chr

(PHP 4, PHP 5)

chr — Return a specific character

Report a bug
 Description

string chr ( int $ascii )
Returns a one-character string containing the character specified by ascii.

This function complements ord().

important is the word ascii :) try this one:

  function uchr ($codes) {
        if (is_scalar($codes)) $codes= func_get_args();
        $str= '';
        foreach ($codes as $code) $str.= html_entity_decode('&#'.$code.';',ENT_NOQUOTES,'UTF-8');
        return $str;
    }
    echo "Section sign: ".uchr(167)."\n"; // Using CHR function
    echo "Section sign: ".uchr(0xA7)."\n";


I know I am reopening an old, solved issue, however since I stumbled into that topic searching for help, I thought I would share the solution I ended up with. The initial person asking the question might be interested in refactoring his/her code for the best.

Manually reprogramming ascii-to-unicode is like reinventing the wheel, not talking about errors/performance potential.

The best solution I found was to use:

  1. pack to create values from input data, using the appropriate codes to eat the right amount of data, usually pack("H*", <input data>) to read from hexadecimal values
  2. mb_convert_encoding to convert ASCII strings to unicode ones, using mb_convert_encoding(<ASCII string>, "UTF-8"). If the input string is not recognized properly, a third parameter of this function allows to specify the input encoding
0

精彩评论

暂无评论...
验证码 换一张
取 消