Are null terminators part of text encoding?_问答_开发者

Are null terminators part of text encoding?

开发者 https://www.devze.com 2023-01-28 07:44 出处：网络

I\'m trying to read a null terminated string from a byte array; the parameter to the function is the encoding.

I'm trying to read a null terminated string from a byte array; the parameter to the function is the encoding.

string ReadString(Encoding encoding)

For example, "foo" in the following encodings are:

UTF-32: 66 00 00 00 6f 00 00 00 6f 00 00 00
UTF-8:  66 6f 6f
UTF-7:  66 6f 6f 2b 41 41 41 2d

If I copied all the bytes into an array (reading up to the null terminator) and passed that array into encoding.GetString(), it wouldn't work because if the string was UTF-32 encoded my algorithm would reach the "null terminator" after the second byte.

So I sort of have a double question: Are null terminators part of the encoding? If not, how could I decode the string character by character and check the following byte for the null terminator?

Thanks in advance

(suggestions are also appreciated)

Edit:

If "foo" was null terminated and utf-32 encoded, which would it be?:

1. 66 00 00 00 6f 00 00 00 6f 00 00开发者_Python百科 00   00
2. 66 00 00 00 6f 00 00 00 6f 00 00 00   00 00 00 00

The null terminator is not "logically" part of the string; it's not considered payload. It's widely used in C/C++ to indicate where the string ends.

Having said that you can have strings with embedded \0's but then you have to be careful to ensure the string doesn't appear truncated. For example std::string doesn't have a problem with embedded \0's. But if do a c_str() and and not account for the reported length() your string will appear cut off.

Null terminators are not part of the encoding, but the string representation used by some programming language, such as C. In .NET, System.String is prefixed by the string length as a 32-bit integer and is not null-terminated. Internally System.String is always UTF-16, but you can use the encoding to output different representations.

For the second part... Use the classes in System.Text such as UTF8Encoding and UTF32Encoding to read the string. You just have to select the right one based on your parameter...

This seems to work well for me (sample from actual code that reads a unicode, null terminated string from a byte array):

 //trim null-termination from end of string
 byte[] languageId = ...
 string language = Encoding.Unicode.GetString(languageId, 
                                              0,
                                              languageId.Length).Trim('\0');