开发者

Problem with indexed XML file

开发者 https://www.devze.com 2023-01-09 09:31 出处:网络
I scanned 2,8GB XML file for positions (Index) of particular tags. The I use Seek method to set a start point in that file. File is UTF-8 encoded.

I scanned 2,8GB XML file for positions (Index) of particular tags. The I use Seek method to set a start point in that file. File is UTF-8 encoded. So indexing is like that:


using(StreamReader sr = new StreamReader(pathToFile)){
  long index = 0;
  while(!sr.EndOfStream){
    string lin开发者_运维技巧e = sr.ReadLine();
    index += (line.Length + 2); //remeber of \r\n chars

    if(LineHasTag(line)){
      SaveIndex(index-line.Length); //need beginning of the line
    }
  }
}

So afterwards I have in another file indexed positions. But when I use seek it doesn't seem to be good, because the position is set somewhere before it should be. I have loaded some content of that file into char array and I manually checked the good index of a tag I need. It's the same as I indexed by code above. But still Seek method on StreamReader.BaseStream places the pointer earlier in the file. Quite strange.

Any suggestions?

Best regards, ventus


Seek deals in bytes - you're assuming there's one byte per character. In UTF-8, one character in the BMP can take up to three bytes.

My guess is that you've got non-ASCII characters in your file - those will take more than one byte.

I think there may also be a potential problem with the byte order mark, if there is one. I can't remember offhand whether StreamReader will swallow that automatically - which would put you 3 bytes to start with.

0

精彩评论

暂无评论...
验证码 换一张
取 消