Possible Duplicate:
C++ Search Performance
What I have is two text files. One contains a list of roughly 70,000 names (~1.5MB). The other contains text which will be obtained from miscellaneous sources. That is, this file's contents will change each time the program is executed (~0.5MB). Essentially, I want to be able to paste some text into a text file and see which names my list are found. Kind of like the find function (CTR + F) but with 70,000 keywords.
In any case, what I have thus far is:
int main()
{
ifstream namesfile("names.txt"); //names list
ifstream miscfile("misc.txt"); //misc text
vector<string> vecnames; //vector to hold names
vector<string> vecmisc; //vector to hold misc text
size_t found;
string s;
string t;
while (getline(namesfile,s))
vecnames.push_back(s);
while (getline(miscfile,t))
vecmisc.push_back(t);
//outer loop iterates through names list
for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
//inner loop iterates 开发者_C百科through the lines of the mist text file
for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) {
found=vecmisc[j].find(vecnames[i]);
if (found!=string::npos) {
cout << vecnames[i] << endl;
break;
}
}
}
cout << "SEARCH COMPLETE";
//to keep console application from exiting
getchar();
return 0;
}
Now this works great as far as extracting the data I need, however, it is terribly slow and obviously inefficient since each name requires that I potentially search the entire file again which gives (75000 x # of lines in misc text file) iterations. If anyone could help, I would certainly appreciate it. Some sample code is most welcomed. Additionally, I'm using Dev C++ if that makes any difference.
It has been suggested that I implement a hash set on my data, however, I have no idea how to go about this. If anyone understands how I could apply this method I'd appreciate a start in the right direction. Sincere thanks.
You could construct a trie from all the names and mark the nodes that are endpoints so you know when you have a match (or you could wait for a mismatch and emit the substring up to that point from the end of the last match). Then you try to match the input to the trie, one char at a time, and you should have O(n) performance.
trieRoot = preprocessedListOfNames
trieCursor = trieRoot
for each character in text
if character in trieCursor.neighbors
trieCursor = trieCursor.neighbors[character]
else
if matchSize > 1 and trieCursor.isEndpoint
emit match
trieCursor = trieRoot
If the name list is relatively static you could even pre-process it and store it so you don't have to construct it each time you want to do a search.
Change vecnames
from a vector to a set. Change its call to push_back to insert. Then, instead of looping over it, just loop over vecmisc
and call vecnames.find(...)
to check if each input is one of the names. This will turn your O(n m) system into O(n log m). You could also use hash_set and achieve O(n) (which may or may not be much faster in practice).
You might be better off reading the bigger file into memory; qsort()ing it; and then reading the second file line by line and bsearch()ing on each entry in that second file.
You could use a map/Associative-Array Data Structure from STL. A map doesn't necessarily store data in a linear fashion and hence a find operation generally takes less than linear time - ie - < O(n) .
For your case you can use a map of type - map<string,bool>
.
example usage - http://www.cprogramming.com/tutorial/stl/stlmap.html
Replace vector<string> vecmisc;
with map<string,bool> vecmisc
.
for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
// No inner loop needed
found=vecmisc.find(vecnames[i]);
if (found!=string::npos) {
cout << vecnames[i] << endl;
break;
}
}
A number of slight improvements.
Benchmark before: 7 minutes 56 second and counting (will update)
Update: finally finished in 15m25s, yielding a performance increase of roughly 3000 x
Benchmark after: 0.3? seconds (see below for updated figures)
Code:
#include <set>
#include <string>
#include <iostream>
#include <iterator>
#include <fstream>
template <class It> It readInto(std::istream& is, It OI);
std::set<std::string> readnames(const std::string& filename)
{
std::string s;
std::set<std::string> result;
std::ifstream namesfile(filename.c_str()); //names list
readInto(namesfile, std::inserter(result, result.end()));
return result;
}
int main()
{
std::set<std::string> vecnames = readnames("names.txt");
//inner loop iterates through the lines of the mist text file
std::ifstream miscfile("misc.txt"); //misc text
std::string line;
while (std::getline(miscfile, line))
if (vecnames.end() != vecnames.find(line))
std::cout << line << std::endl;
return 0;
}
// helper to read linewise into output iterator
template <class It> It readInto(std::istream& is, It OI)
{
std::string line;
while (std::getline(is, line))
{
if (line.size()>0) // TODO you may want to trim/normalize these
OI++ = line;
}
return OI;
}
Data:
$ cp /etc/dictionaries-common/words names.txt $ $ wc names.txt misc.txt 98569 98568 931708 names.txt 166634 529910 4283592 misc.txt
This results in 151486 lines of output (which contain 3968 uniqe values, when inspected:)
$ ./t2 | wc -l
859
$ ./t2 | sort -u | wc -l
2
Because that is a pretty skewed benchmark, I benchmarked the other extreme as well:
$ cp names.txt misc.txt
$ time ./t | wc -l
98568
real 0m0.365s
user 0m0.372s
sys 0m0.228s
Tests performed with optimized builds on i386 (32bit) g++ 4.4.5, redirecting stdout to /dev/null and removing the getchar()
call at the end
精彩评论