I am tryi开发者_StackOverflow社区ng to create a lexicographically sorted index of words along with their position in a text file.
With the help of experts in this forum I am able to create a lexicographically sorted index of words. I now need help with storing the position of the lexicographically sorted index of words
this is what i have so far:- A text file (sometextfile.txt) containing data as follows:- "This is a sample text file"
private const string filepath = @"d:\sometextfile.txt";
using (StreamReader sr = File.OpenText(filepath))
{
string input;
//dictionary to store the position of the characters in the file as long and the lexicographically sorted value as string
var parts = new Dictionary<long,string>();
while ((input = sr.ReadLine()) != null)
{
string[] words = input.Split(' ');
foreach (var word in words)
{
var sortedSubstrings =
Enumerable.Range(0, word.Length)
.Select(i => word.Substring(i))
.OrderBy(s => s);
parts.AddRange(<store the position of the character>, sortedSubstrings);
}
}
}
Using ReadLine loses some critical information about your position in the file, if you intend the position to be a byte position that you can seek to. The end of the line could be marked by a carriage return (\r) or a line feed (\n) or both, so you kind of need to know how many bytes were at the end of the line. It's also possible (depending on the encoding of the text file) that characters could be represented with varying numbers of bytes, which may also need to handle. I suggest reading the file at a lower level so you can track your position.
var parts = new Dictionary<long,string>();
using (System.IO.StreamReader sr = new System.IO.StreamReader(myfile))
{
var sb = new System.Text.StringBuilder();
long currentPosition = 0;
long wordPosition = 0;
bool wordStarted = false;
int nextCharNum = sr.Read();
while (nextCharNum >= 0)
{
char nextChar = (char)nextCharNum;
switch(nextChar)
{
case ' ':
case '\r':
case '\n':
if (wordStarted)
{
parts[wordPosition] = sb.ToString();
sb.Clear();
wordStarted = false;
}
break;
default:
sb.Append(nextChar);
if (!wordStarted)
{
wordPosition = currentPosition;
wordStarted = true;
}
break;
}
currentPosition += sr.CurrentEncoding.GetByteCount(nextChar.ToString());
nextCharNum = sr.Read();
}
if (wordStarted)
parts[wordPosition] = sb.ToString();
}
foreach (var de in parts)
{
Console.WriteLine("{0} {1}", de.Key, de.Value);
}
If you can use {line number, word number in the line} pair as position than it is very easy to compute in your code by just counting lines and for each line count words.
精彩评论