Is it safe to use random Unicode for complex delimiter sequences in strings?_问答_开发者

Question: In terms of program stability and ensuring that the system will actually operate, how safe is it to use chars like ¦, § or ‡ for complex delimiter sequences in strings? Can I reliable believe that I won't run into any issues in a program reading these incorrectly?

I am working in a system, using C# code, in which I have to store a fairly complex set of information within a single string. The readability of this string is only necessary on the computer side, end-users should only ever see the information after it has been parsed by the appropriate methods. Because some of the data in these strings will be collections of variable size, I use different delimiters to identify what parts of the string correspond to a certain tier of organization. There are enough cases that the standard sets of ;, |, and similar ilk have been exhausted. I considered two-char delimiters, like ;# or ;|, but I felt that it would be very inefficient. There probably isn't that large of a performance difference in storing with one char versus two chars, but when I have the option of picking the smaller option, it just feels wrong to pick the larger one.

So finally, I considered using the set of characters like the double dagger and section. They only take up one char, and they are definitely not going to show up in the actual text that I'll be storing, so they won't be confused for anything.

But character encoding is finicky. While the visibility to the end user is meaningless (since they, in fact, won't see it), I became recently concerned about how the programs in the system will read it. The string is stored in one database, while a separate program is responsible for both encoding and decoding the string into different object types for the rest of the application to work with. And if something is expected to be written one way, is possibly written another, then maybe the whole system will fail and I can't really let that happen. So is it safe to use these kind of chars for backgrou开发者_运维百科nd delimiters?

Because you must encode the data in a string, I am assuming it is because you are interfacing with other systems. Why not use something like XML or JSON for this rather than inventing your own data format?

With XML you can specify the encoding in use, e.g.:

<?xml version="1.0" encoding="UTF-8"?>

There is very little danger that any system that stores and retrieves Unicode text will alter those specific characters.

The main characters that can be altered in a text transfer process are the end of line markers. For example, FTPing a file from a Unix system to a Windows system in text mode might replace LINE FEED characters for CARRIAGE RETURN + LINE FEED pairs.

After that, some systems may perform a canonical normalization of the text. Combining characters and characters with diacritics on them should not be used unless canonical normalization (either composing or decomposing) is taken into account. The Unicode character database contains information about which transformations are required under these normalization schemes.

That sums up the biggest things to watch out for, and none of them are a problem for the characters that you have listed.

Other transformations that might be made, but are less likely, are case changes and compatibility normalizations. To avoid these, just stay away from alphabetic letters or anything that looks like an alphabetic letter. Some symbols are also converted in a compatibility normalization, so you should check the properties in the Unicode Character Database just to be sure. But it is unlikely that any system will do a compatibility normalization without expressly indicating that it will do so.

In the Unicode Code Charts, cannonical normalizations are indicated by "≡" and compatability normalizations are indicated by "≈".

You could take the same approach as URL or HTML encoding, and replace key chars with sequences of chars. I.e. & becomes &.

Although this results in more chars, it could be pretty efficiently compressed due to the repetition of those sequences.

Well, UNICODE is a standard, so as long as everybody involved (code, db, etc) is using UNICODE, you shouldn't have any problems.

There are rarer characters in the Unicode set. As far as I know, only the chars below 0x32 (space) have special meanings, anything abovde that should be preserved in an NVARCHAR data column.

It is never going to be totally safe unless you have a good specification what characters can and cannot be part of your data.

Remember some of the laws of Murphy:

"Anything that can go wrong will."

"Anything that can't go wrong, will anyway."

Those characters that definitely will not be used, may eventually be used. When they are, the application will definitely fail.

You can use any character you like as delimiter, if you only escape the values so that character is guaranteed not to appear in them. I wrote an example a while back, showing that you could even use a common character like "a" as delimiter.

Escaping the values of course means that some characters will be represented as two characters, but usually that will still be less of an overhead than using a multiple character delimiter. And more importantly, it's completely safe.