ICU Unicode Normal vs Fullwidth_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-18 17:21 出处：网络

I am somewhat new to unicode and unicode strings. I\'m trying to determine the difference between \"fullwidth\" symbol and a normal one.

I am somewhat new to unicode and unicode strings. I'm trying to determine the difference between "fullwidth" symbol and a normal one.

Take these two for example:

Normal: http://www.fileformat.info/info/unicode/char/20a9/index.htm

Fullwidth: http://www.fileformat.info/开发者_JAVA技巧info/unicode/char/ffe6/index.htm

I notice that the fullwidth is defined as U+20A9 and coincidentally 20A9 is the normal one. So what is the value of U?

When using libraries like ICU is there a way to specify always return normal versus full?

Thanks,

U+number is a notational convention for a Unicode code point. There is no 'value' of U.

U+0020, for example, is a space. The value in memory is 32 decimal, 20 hex.

Full width characters are a whole other story.

Back in the days of the 3270, Hanzi took up two positions in memory in the display. So they also took up two columns on the screen. To make things line up neatly, IBM defined a set of 'full-width' (better would have been 'double-width') letters and numbers.

If some ICU API is delivering full-width, you can use the Normalizer to get rid of it. You might also post a ticket to their ticket system, this seems odd.

The 'U' in "U+2049" just denotes that "2049" is a Unicode code point, the value of the Won character in the Unicode codespace. It's a notation used in the Unicode Standard. The "U+" shall be followed by a hexadecimal number, using at least 4 digits, such as "U+1234" or "U+10FFFD".

U+20A9 (₩) is the WON SIGN
U+FFE6 (￦) is the FULLWIDTH WON SIGN

This is a legacy of older character encodings. The "width" affected layout. The Unicode spec says:

Compatibility variants are a subset of compatibility characters, and have the further characteristic that they represent variants of existing, ordinary, Unicode characters. For example, compatibility variants might represent various presentation or styled forms of basic letters: superscript or subscript forms, variant glyph shapes, or vertical presentation forms. They also include halfwidth or fullwidth characters from East Asian character encoding standards, Arabic contextual form glyphs from pre-existing Arabic code pages, Arabic ligatures and ligatures from other scripts, and so on. Compatibility variants also include CJK compatibility ideographs, many of which are minor glyph variants of an encoded unified CJK ideograph.

Including these forms in Unicode allows the conversion of text from (and to) the older encodings without loss of meaning.

References: