String class based on graphemes?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-20 18:14 出处：网络

I\'m wondering why we don\'t have some string classes that represent a string of Unicode grapheme clusters instead of code points or characters.It seems to me that in most applications it would be eas

I'm wondering why we don't have some string classes that represent a string of Unicode grapheme clusters instead of code points or characters. It seems to me that in most applications it would be easier for programmers to access components of a grapheme when necessary than to have to organize them from code points, which appears necessary even if only to avoid casually breaking a string in "mid-grapheme" (at least in theory). Internally a string class might use a variable length encoding such as UTF-8, UTF-16, or in this context even UTF-32 is variable length; or implement subclasses for all of them (and optionally configure the choice at run-time so that different languages could use their optimal encodings). But if programmers could "see" grapheme units when inspecting a string, wouldn't string handling code in general be closer to achieving correctness, and without much extra complexity?

References:

Characters and Combining Marks

Unicode implementer's guide part 4: grapheme breaking

UnicodeString Class Reference

Enumerating a string by grapheme instead of character

Strings a开发者_运维百科nd character encoding in C++

I don't think so, because grapheme breaks are not the only measure of correctness. And, there are different user perceived characters depending on the language/script being used. If you are concerned about normalization mode you will also want to see Normalizer::concatenate. So I would recommend just working in code units most of the time and calculating breaks when need be.