Unicode Precomposition and Decomposition with Delphi_问答_开发者

The Wikipedia entry for Subversion contains a paragraph about problems with different ways of Unicode encoding:

While Subversion stores filenames as Unicode, it does not specify if precomposition or decomposition is used for certain accented characters (such as é). Thus, files added in SVN clients running on some operating systems (such as OS X) use decomposition encoding, while clients running on other operating systems (such as Linux) use precomposition encoding, with the consequence that those accented characters do not display correctly if the local SVN client is not using the same encoding as the client used to add the files

While this describes a specific problem with Subversion client implementations, I am not sure if the underlying Unicod开发者_如何学运维e composition problem could also appear with regular Delphi applications. I guess that the problem can only arise if Delphi applications are able to use both Unicode encoding ways (maybe in Delphi XE2). If yes, what could Delphi developers do to avoid it?

There is a minor display issue in that many fonts used on Windows won't render the decomposed form in the ideal way, by using the combined glyph for both the letter and the diacritical. Instead it falls back to rendering the letter and than overlaying the standalone diacritical mark on top, which typically results in a less visually pleasing, potentially-lopsided grapheme.

However that is not the issue the Subversion bug referenced from wiki is talking about. It's actually completely fine to check in filenames to SVN that contain composed or decomposed character sequences; SVN neither knows nor cares about composition, it just uses the Unicode code points as-is. As long as the backend filesystem leaves filenames in the same state as they were put in, all is fine.

Windows and Linux both have filesystems that are equally blind to composition. Mac OS X, unfortunately, does not. Both HFS+ and UFS filesystems perform ‘normalisation’ to decomposed form before storing an incoming filename, so the filename you get back won't necessarily be the same sequence of Unicode code points you put in.

It is this [IMO: insane] behaviour that confuses SVN—and many other programs—when being run on OS X. It's particularly likely to bite because Apple happened to choose decomposed (NFD) as their normalisation form, whereas most of the rest of the world uses composed (NFC) characters.

(And it's not even real NFD, but an incompatible Apple-only variant. Joy.)

The best way to cope with this is, if you can, is never to rely on the exact filename something's stored under. If you only ever read a file from a given name, that's fine, as it'll be normalised to match the filesystem at the time. But if you're reading a directory listing and trying to match filenames you find in there to what you expected the filename to be—which is what Subversion is doing—you're going to get mismatches.

To do a filename match reliably you would have to detect that you're running on OS X, and manually normalise both the filename and the string to some normal form (NFC or NFD) before doing the comparison. You shouldn't do this on other OSes which treat the two forms as different.

AFAICT, both encodings should produce the same results when displaying, and both are valid Unicode, so I don't quite see the problem there. A display routine should be able to handle both if decomposition is encountered for. A code point é should display as-is, while e´ should only display as é in decomposition mode.

The problem is not display, IMO, it is comparison, either for equality (which fails if both use a different encoding) or lexically, i.e. for sorting. That is why one should normalize to one encoding, as David says. That way there are no abmiguities anymore.

The same problem could arise in any application that deals with text. How to avoid it depends on what operations the application is performing and the question lacks specific details. Mostly I think you'd solve such problems by normalizing the text. This involves using a single preferred representation whenever you encounter ambiguity of encoding.