开发者

Objective-C How to get unicode character

开发者 https://www.devze.com 2023-02-05 03:11 出处:网络
I want to get unicode code point for a given unicode character in Objective-C. NSString said it internal use UTF-16 encoding and said,

I want to get unicode code point for a given unicode character in Objective-C. NSString said it internal use UTF-16 encoding and said,

The NSString class has two primitive methods—length and characterAtIndex:—that provide the basis for all other methods in its interface. The length method returns the total number of Unicode characters in the string. characterAtIndex: gives access to each character in the string by index, with index values starting at 0.

That seems assume characterAtIndex method is unicode aware. However it return unichar is a 16 bits unsigned int type.

- (unichar)characterAtIndex:(NSUInteger)index

The questions are:

  • Q1: How it present unicode code point above UFFFF?

  • Q2: If Q1 make sense, is there method to get unicode code point for a giv开发者_如何转开发en unicode character in Objective-C.

Thx.


The short answer to "Q1: How it present unicode code point above UFFFF?" is: You need to be UTF16 aware and correctly handle Surrogate Code Points. The info and links below should give you pointers and example code that allow you to do this.

The NSString documentation is correct. However, while you said "NSString said it internal use UTF-16 encoding", it's more accurate to say that the public / abstract interface for NSString is UTF16 based. The difference is that this leaves the internal representation of a string a private implementation detail, but the public methods such as characterAtIndex: and length are always in UTF16.

The reason for this is it tends to strike the best balance between older ASCII-centric and Unicode aware strings, largely due to the fact that Unicode is a strict superset of ASCII (ASCII uses 7 bits, for 128 characters, which are mapped to the first 128 Unicode Code Points).

To represent Unicode Code Points that are > U+FFFF, which obviously exceeds what can be represented in a single UTF16 Code Unit, UTF16 uses special Surrogate Code Points to form a Surrogate Pair, which when combined together form a Unicode Code Point > U+FFFF. You can find details about this at:

  • Unicode UTF FAQ - What are surrogates?
  • Unicode UTF FAQ - What’s the algorithm to convert from UTF-16 to character codes?
  • Although the official Unicode UTF FAQ - How do I write a UTF converter? now recommends the use of International Components for Unicode, it used to recommend some code officially sanctioned and maintained by Unicode. Although no longer directly available from Unicode.org, you can still find copies of the "no longer official" example code in various open-source projects: ConvertUTF.c and ConvertUTF.h. If you need to roll your own, I'd strongly recommend examining this code first, as it is well tested.


From the documentation of length:

The number returned includes the individual characters of composed character sequences, so you cannot use this method to determine if a string will be visible when printed or how long it will appear.

From this, I would infer that any characters above U+FFFF would be counted as two characters and would be encoded as a Surrogate Pair (see the relevant entry at http://unicode.org/glossary/).

If you have a UTF-32 encoded string with the character you wish to convert, you could create a new NSString with initWithBytesNoCopy:length:encoding:freeWhenDone: and use the result of that to determine how the character is encoded in UTF-16, but if you're going to be doing much heavy Unicode processing, your best bet is probably to get familiar with ICU (http://site.icu-project.org/).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号