How to best read and UTF-8-decode byte buffers?_问答_开发者

How to best read and UTF-8-decode byte buffers?

开发者 https://www.devze.com 2023-03-30 20:33 出处：网络

I have a Stream that produces UTF-8 encoded strings. The strings represent XML documents that I need to parse. The stream is obtained from a TcpClient.

相关专题：encoding utf-8

I have a Stream that produces UTF-8 encoded strings. The strings represent XML documents that I need to parse. The stream is obtained from a TcpClient.

Suppose I read the stream into buffers of size 64 (a little small, I know). Passing these 64 byte buffers directly to the string decoding step could fail because some UTF-8 encoded characters may be split along the 64 byte boundary. The buffer may end with the first two bytes of a character and the next buffer has the last byte for this character.

What I do now, is concatenate buffers until I perform a read that doesn't read the full 64 bytes, indicating that I have read to the end of something (in my case, an XML document). How开发者_如何学Goever, once in a while, an XML documents I read ends exactly at the 64 byte boundary. In such a case, I do not know I can pass the byte array to the decoding step (and I need to wait for the next document).

I realize I can lower the chances by increasing the buffer size. However, a small chance always remains that it happens. I could also increase the buffer size such that any XML document I encounter will fit, but I just wonder whether there is another solution, somehow detecting from the byte stream where the character boundaries are.

You are right about the problems and pitfalls.

The solution already exists: wrap a StreamReader around your stream and use Read() and ReadLine()

If you do want a DIY solution you'll have to look at the Encoder state properties. Beyond my capabilities.

I believe that your approach is theoretically flawed, even if it should always work correctly in practice: there is no guarantee that a successful read of less than (buffer size) indicates that an XML document has been received in its entirety. The TCP stack is fully within its rights to give you back the document one byte at a time. Increasing the buffer size to several KB should cause this problem to manifest itself.

Addressing the above flaw properly will also solve your current issue: prepend some kind of fixed-length header (e.g. 8 bytes) that contains the following document's length before each XML document in your TCP stream. You will always know when you have read a full header (because it's fixed size), and given the header you will know when you have received the whole document.