I'm writing a class which is used to work against a byte[]
buffer. It contains methods like char Peek()
and string ReadRestOfLine()
.
The problem is that I would like to add support for unicode and I don't really know how I should change those methods (they only support ASCII now).
How do I detect that the next bytes in the buffer is a unicode sequence (utf8 or utf16)? And how do I convert them to a char
?
Update
Yes, the class is a bit similar to the StreamReader
, but with the difference that it will avoid creating objects (like string
, char[]
) etc until the entire wanted string has been found. It's used in a high performance socket framework.
For instance: Let's say that I want write a proxy that will only check the URI in a HTTP request. If I where to use the StreamReader
I would have to build a temp char array each time a new receive have been completed just to see if a new line character have been received.
By using a class that 开发者_JS百科works directly against the byte[]
buffer that socket.ReceiveAsync
uses, I just have to traverse the buffer in my parser to know if the next step can be completed. No temporary objects are created.
For most protocols ASCII is used in the header area and UTF8 will not be a problem (the request body can be parsed using StreamReader
). I'm just interested in how it can be solved avoiding to create unnecessary objects.
I don't think you want to go there. There are tons of stuff that can go wrong. First of all: What encoding are you using? Then, does the buffer contain the entire encoded string? Or does it start at some random position, possibly inside such a sequence?
Your classes sound a bit like a StreamReader
for a MemoryStream
. Maybe you can use those?
From the documentation:
Implements a TextReader that reads characters from a byte stream in a particular encoding.
If the point of your exercise is to figure out how to do this yourself... take a peek into how the library did it. I think you'll find the method StreamReader.Read()
interesting:
Reads the next character from the input stream and advances the character position by one character.
There is a one-to-one correspondance between bytes and ASCII characters making it easy to treat bytes as characters. Modifying your code to handle various encodings of UNICODE may not be easy. However, to answer part of your question:
How do I detect that the next bytes in the buffer is a unicode sequence (utf8 or utf16)? And how do I convert them to a
char
?
You can use the System.Text.Encoding
class. You can use the predefined encoding objects Encoding.Unicode
and Encoding.UTF8
and use methods like GetCharCount
, GetChars
and GetString
.
I've created a BufferSlice
class which wraps the byte[] buffer and makes sure that only the assigned slice is used. I've also created a custom reader to parse the buffer.
UTF turned out to not be a problem since I only parse the buffer to find characters that is not multi-bytes (space, minus, semicolon etc). I then use Encoding.GetString
from the last delimiter to the current to get a proper string back.
精彩评论