I'm sure this is something very simple that I'm screwing up, but here goes:
I'm trying to parse a log file that is generally formatted in UNICODE (and I'll freely admit that I don't generally know much about UNICODE, but the first two bytes of the file are 0xFFFE, and there's a zero between every other character). The peculiar part is that this file appears to end lines with the byte sequence 0x0D000D0A, that is, \r\0\r\n, and that's apparently confusing my TextReader
from reading it.
That is, every other line I print is filled with:
?????????????????? ???????????? ? ????????? ? ????????????? ? ?????????????? ???? ??? ????? ???????????????????? ??? ???????????? ????????????????? ?????????????????????? ???????????????????? ?????? ????????????????????? ????????????? ?????
What is the recommended way for me to go about parsing this using C#? Or rather, what am I doing wrong?
Thanks!
Update: Sorry, I should have probably included the code I was using in my initial posting. Here it is:
FileStream fsa = File.Open(@"C:\InboxLOG.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
TextReader sr = new StreamReader(fsa, Encoding.Unicode, true);
string line = "";
while ((line = sr.ReadLine()) != null)
{ 开发者_StackOverflow中文版
Console.WriteLine(line);
}
Using StreamReader(fsa)
produces the same results.
Hmmm... 0x0D000D0A?
Your line endings indeed look borked. You might have to parse it more manually via a Stream... I would have expected 0x0D000A000? (since this is little-endian). I wonder if a non-Unicode process has done a "replace lf with crlf" sweep and mucked it up. You could of course do the same, and (processing bytes in blocks of two) replace 0D0A with 0A00 (starting on even bytes only). But starting with non-corrupt data is always a better option...
was:
0xFFFE is a BOM, so anything involving StreamReader
etc (such as File.OpenText
) should handle this automatically and choose the right encoding. If not, give it a clue:
using(var reader = new StreamReader(path, Encoding.Unicode)) {
...
}
Please try this
StreamReader reader = new StreamReader(filePath, System.Text.Encoding.Unicode, true);
It seems like UTF16 encoding, 0xFFFE is byte order mark
http://en.wikipedia.org/wiki/Byte_order_mark
I'm guessing you're actually using a StreamReader as TextReader is an abstract class.
From your description you text is in UTF-16, but StreamReader defaults to UTF-8. When you construct your StreamReader, you need to tell it to use UTF-16 instead:
new StreamReader(..., System.Text.Encoding.Unicode);
精彩评论