Parsing Peculiar Newlines_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-13 01:41 出处：网络

I\'m sure this is something very simple that I\'m screwing up, but here goes: I\'m trying to parse a log file that is generally formatted in UNICODE (and I\'ll freely admit that I don\'t generally kn

I'm sure this is something very simple that I'm screwing up, but here goes:

I'm trying to parse a log file that is generally formatted in UNICODE (and I'll freely admit that I don't generally know much about UNICODE, but the first two bytes of the file are 0xFFFE, and there's a zero between every other character). The peculiar part is that this file appears to end lines with the byte sequence 0x0D000D0A, that is, \r\0\r\n, and that's apparently confusing my TextReader from reading it.

That is, every other line I print is filled with:

?????????????????? ???????????? ?      ?????????  ? ?????????????  ? ?????????????? ???? ??? ????? ???????????????????? ??? ???????????? ????????????????? ?????????????????????? ???????????????????? ?????? ????????????????????? ????????????? ?????

What is the recommended way for me to go about parsing this using C#? Or rather, what am I doing wrong?

Thanks!

Update: Sorry, I should have probably included the code I was using in my initial posting. Here it is:

FileStream fsa = File.Open(@"C:\InboxLOG.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
TextReader sr = new StreamReader(fsa, Encoding.Unicode, true);
string line = "";
while ((line = sr.ReadLine()) != null)
{          开发者_StackOverflow中文版    
     Console.WriteLine(line);
}

Using StreamReader(fsa) produces the same results.

Hmmm... 0x0D000D0A?

Your line endings indeed look borked. You might have to parse it more manually via a Stream... I would have expected 0x0D000A000? (since this is little-endian). I wonder if a non-Unicode process has done a "replace lf with crlf" sweep and mucked it up. You could of course do the same, and (processing bytes in blocks of two) replace 0D0A with 0A00 (starting on even bytes only). But starting with non-corrupt data is always a better option...

was:

0xFFFE is a BOM, so anything involving StreamReader etc (such as File.OpenText) should handle this automatically and choose the right encoding. If not, give it a clue:

using(var reader = new StreamReader(path, Encoding.Unicode)) {
    ...
}

Please try this

StreamReader reader = new StreamReader(filePath, System.Text.Encoding.Unicode, true);

It seems like UTF16 encoding, 0xFFFE is byte order mark

http://en.wikipedia.org/wiki/Byte_order_mark

I'm guessing you're actually using a StreamReader as TextReader is an abstract class.

From your description you text is in UTF-16, but StreamReader defaults to UTF-8. When you construct your StreamReader, you need to tell it to use UTF-16 instead:

new StreamReader(..., System.Text.Encoding.Unicode);

Parsing Peculiar Newlines

精彩评论

关注公众号

热门标签

图文推荐

Parsing Peculiar Newlines

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：