I have a large text file that I need to searc开发者_开发问答h for a specific string. Is there a fast way to do this without reading line by line?
This method is extremely slow because of the size of the files (more than 100 MB).
Given the size of the files would you really want to read them entirely into memory beforehand? Line by line is likely to be the best approach here.
Here is my solution that uses a stream to read in one character at a time. I created a custom class to search for the value one character at a time until the entire value is found.
I ran some tests with a 100MB file saved on a network drive and the speed was totally dependent on how fast it could read in the file. If the file was buffered in Windows a search of the entire file took less than 3 seconds. Otherwise it could take anywhere from 7 seconds to 60 seconds, depending on network speed.
The search itself took less than a second if run against a String in memory and there were no matching characters. If a lot of the leading characters found matches the search could take a lot longer.
public static int FindInFile(string fileName, string value)
{ // returns complement of number of characters in file if not found
// else returns index where value found
int index = 0;
using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName))
{
if (String.IsNullOrEmpty(value))
return 0;
StringSearch valueSearch = new StringSearch(value);
int readChar;
while ((readChar = reader.Read()) >= 0)
{
++index;
if (valueSearch.Found(readChar))
return index - value.Length;
}
}
return ~index;
}
public class StringSearch
{ // Call Found one character at a time until string found
private readonly string value;
private readonly List<int> indexList = new List<int>();
public StringSearch(string value)
{
this.value = value;
}
public bool Found(int nextChar)
{
for (int index = 0; index < indexList.Count; )
{
int valueIndex = indexList[index];
if (value[valueIndex] == nextChar)
{
++valueIndex;
if (valueIndex == value.Length)
{
indexList[index] = indexList[indexList.Count - 1];
indexList.RemoveAt(indexList.Count - 1);
return true;
}
else
{
indexList[index] = valueIndex;
++index;
}
}
else
{ // next char does not match
indexList[index] = indexList[indexList.Count - 1];
indexList.RemoveAt(indexList.Count - 1);
}
}
if (value[0] == nextChar)
{
if (value.Length == 1)
return true;
indexList.Add(1);
}
return false;
}
public void Reset()
{
indexList.Clear();
}
}
In all cases, you will have to go over all of the file.
Lookup Rabin-Karp string search or similar.
The fastest method for searching is the Boyer-Moore algorithm. This method does not require to read all bytes from the files, but require random access to bytes. Also, this method is simple in realization.
Is your project needing to search different files for the same or different string every time, or searching the same file for different strings every time?
If it's the latter, you could build an index of the file. But there's no point doing this if the file changes frequently, because building the index will be expensive.
To index a file for full text searching, you could use the Lucene.NET library.
http://incubator.apache.org/lucene.net/
You should be able to read the file character by character matching each character in the search string until you reach the end of the search string in which case you have a match. If at any point the character you've read doesn't match the character you're looking for, reset the matched count to 0 and start again. For example (****pseudocode/not tested****):
byte[] lookingFor = System.Text.Encoding.UTF8.GetBytes("hello world");
int index = 0;
int position = 0;
bool matchFound = false;
using (FileStream fileStream = new FileStream(fileName, FileMode.Open))
{
while (fileStream.ReadByte() == lookingFor[index])
{
index++;
if (index == lookingFor.length)
{
matchFound = true;
position = File.position - lookingFor.length;
break;
}
}
}
That is one of many algorithms you could use (although it may be off by one with the length check). It will only find the first match so you probably want to wrap the while loop in another loop to find multiple matches.
Also, one thing to note about reading the file line by line is that if the desired string to match spans lines you're not going to find it. If that's fine then you can search line by line but if you need search strings to span lines you'll want to use an algorithm like I detailed above.
Finally, if you're looking for best speed, which it sounds like you are, you'll want to migrate the code above to use a StreamReader or some other buffered reader.
As Wayne Cornish already said: Reading line by line might be the best approach.
If you read for instance the entire file into a string and then search with a regular expression, it might be more elegant, but you'll create a large string object.
These kind of objects might cause problems, because they will be stored on the Large Object Heap (LOH, for objects above 85.000 bytes). If you parse many of these big files and your memory is limited (x86), you are likely to run into LOH fragmentation problems.
=> Better read line by line if you parse many big files!
Here's a simple one-function solution reading character by character. Worked fine for me.
/// <summary>
/// Find <paramref name="toFind"/> in <paramref name="reader"/>.
/// </summary>
/// <param name="reader">The <see cref="TextReader"/> to find <paramref name="toFind"/> in.</param>
/// <param name="toFind">The string to find.</param>
/// <returns>Position within <paramref name="reader"/> where <paramref name="toFind"/> starts or -1 if not found.</returns>
/// <exception cref="ArgumentNullException">When <paramref name="reader"/> is null.</exception>
/// <exception cref="ArgumentException">When <paramref name="toFind"/> is null or empty.</exception>
public int FindString(TextReader reader, string toFind)
{
if(reader == null)
throw new ArgumentNullException("reader");
if(string.IsNullOrEmpty(toFind))
throw new ArgumentException("String to find may not be null or empty.");
int charsRead = -1;
int pos = 0;
int chr;
do
{
charsRead++;
chr = reader.Read();
pos = chr == toFind[pos] ? pos + 1 : 0;
}
while(chr >= 0 && pos < toFind.Length);
int result = chr < 0 ? -1 : charsRead - toFind.Length;
return result < 0 ? -1 : result;
}
Hope that helps.
You could buffer a large amount of data from the file into memory at one time, up to whatever constraint you wish, and then search it for the string.
This would have the effect of reducing the number of reads on the file and would likely be a faster method, but it would be more of a memory hog if you set the buffer size too high.
If you want to speed up the line-by-line reading you can create a queue-based application:
One thread reads the lines and enqeues them into a threadsafe Queue. A second one can then process the strings
I have a large text file that I need to search for a specific string. Is there a fast way to do this without reading line by line?
The only way to avoid searching across the entire file is to sort or organize the input beforehand. For example, if this is an XML file and you need to do many of these searches, it would make sense to parse the XML file into a DOM tree. Or if this is a list of words and you're looking for all the words which start with the letters "aero", it might make sense to sort the entire input first if you do a lot of that kind of searching on the same file.
If you're only looking for a specific string, I'd say line-by-line is the best and most efficient mechanism. On the other hand, if you're going to be looking for multiple strings, particularly at several different points in the application, you might want to look into Lucene.Net to create an index and then query the index. If this is a one-off run (i.e., you won't need to query the same file again later), you can create the index in a temporary file that will be cleaned up automatically by the system (usually boot time; or you can delete it yourself when your program exits). If you need to search the same file again later, you can save the index in a known location and get much better performance the second time around.
Stick it into SQL Server 2005/2008 and use its full-text search capability.
The speed issue here could well be the speed taken to load the file into memory before performing the search. Try profiling your application to see where the bottleneck is. If it is loading the file you could try "chunking" the file load so that the file is streamed in small chunks and each chunk has the search performed on it.
Obviously if the part of the string to be found is at the end of the file there will be no performance gain.
精彩评论