开发者

C++ Search Performance [duplicate]

开发者 https://www.devze.com 2023-03-08 01:30 出处:网络
This question already has answers here: Closed 11 years ago. Possible Duplicate: C++ Search Performance What I have is two text files. One contains a list of roughly 70,000 names (~1.5MB
This question already has answers here: Closed 11 years ago.

Possible Duplicate:

C++ Search Performance

What I have is two text files. One contains a list of roughly 70,000 names (~1.5MB). The other contains text which will be obtained from miscellaneous sources. That is, this file's contents will change each time the program is executed (~0.5MB). Essentially, I want to be able to paste some text into a text file and see which names my list are found. Kind of like the find function (CTR + F) but with 70,000 keywords.

In any case, what I have thus far is:

 int main()
 {

 ifstream namesfile("names.txt");   //names list
 ifstream miscfile("misc.txt");     //misc text
 vector<string> vecnames;           //vector to hold names
 vector<string> vecmisc;            //vector to hold misc text
 size_t found;

 string s;
 string t;

 while (getline(namesfile,s))       
     vecnames.push_back(s);  

 while (getline(miscfile,t))        
     vecmisc.push_back(t);

 //outer loop iterates through names list
 for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
     //inner loop iterates 开发者_C百科through the lines of the mist text file
     for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) {
         found=vecmisc[j].find(vecnames[i]);
         if (found!=string::npos) {
             cout << vecnames[i] << endl;
             break;
         }
     }
 }

 cout << "SEARCH COMPLETE";

 //to keep console application from exiting
 getchar();

 return 0;
 }

Now this works great as far as extracting the data I need, however, it is terribly slow and obviously inefficient since each name requires that I potentially search the entire file again which gives (75000 x # of lines in misc text file) iterations. If anyone could help, I would certainly appreciate it. Some sample code is most welcomed. Additionally, I'm using Dev C++ if that makes any difference.

It has been suggested that I implement a hash set on my data, however, I have no idea how to go about this. If anyone understands how I could apply this method I'd appreciate a start in the right direction. Sincere thanks.


You could construct a trie from all the names and mark the nodes that are endpoints so you know when you have a match (or you could wait for a mismatch and emit the substring up to that point from the end of the last match). Then you try to match the input to the trie, one char at a time, and you should have O(n) performance.

trieRoot = preprocessedListOfNames

trieCursor = trieRoot
for each character in text
    if character in trieCursor.neighbors
        trieCursor = trieCursor.neighbors[character]
    else
        if matchSize > 1 and trieCursor.isEndpoint
            emit match
        trieCursor = trieRoot

If the name list is relatively static you could even pre-process it and store it so you don't have to construct it each time you want to do a search.


Change vecnames from a vector to a set. Change its call to push_back to insert. Then, instead of looping over it, just loop over vecmisc and call vecnames.find(...) to check if each input is one of the names. This will turn your O(n m) system into O(n log m). You could also use hash_set and achieve O(n) (which may or may not be much faster in practice).


You might be better off reading the bigger file into memory; qsort()ing it; and then reading the second file line by line and bsearch()ing on each entry in that second file.


You could use a map/Associative-Array Data Structure from STL. A map doesn't necessarily store data in a linear fashion and hence a find operation generally takes less than linear time - ie - < O(n) .

For your case you can use a map of type - map<string,bool>. example usage - http://www.cprogramming.com/tutorial/stl/stlmap.html

Replace vector<string> vecmisc; with map<string,bool> vecmisc.

for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
 // No inner loop needed

     found=vecmisc.find(vecnames[i]);
     if (found!=string::npos) {
         cout << vecnames[i] << endl;
         break;
     }

 }


A number of slight improvements.

Benchmark before: 7 minutes 56 second and counting (will update) Update: finally finished in 15m25s, yielding a performance increase of roughly 3000 x

Benchmark after: 0.3? seconds (see below for updated figures)

Code:

#include <set>
#include <string>
#include <iostream>
#include <iterator>
#include <fstream>

template <class It> It readInto(std::istream& is, It OI);

std::set<std::string> readnames(const std::string& filename)
{
    std::string s;
    std::set<std::string> result;

    std::ifstream namesfile(filename.c_str());   //names list
    readInto(namesfile, std::inserter(result, result.end()));
    return result;
}

int main()
{
    std::set<std::string> vecnames = readnames("names.txt");

    //inner loop iterates through the lines of the mist text file
    std::ifstream miscfile("misc.txt");     //misc text
    std::string line;
    while (std::getline(miscfile, line))
        if (vecnames.end() != vecnames.find(line))
            std::cout << line << std::endl;

    return 0;
}

// helper to read linewise into output iterator
template <class It> It readInto(std::istream& is, It OI)
{
    std::string line;
    while (std::getline(is, line))
    {
        if (line.size()>0) // TODO you may want to trim/normalize these
            OI++ = line;
    }
    return OI;
}

Data:

$ cp /etc/dictionaries-common/words names.txt $ $ wc names.txt misc.txt 98569 98568 931708 names.txt 166634 529910 4283592 misc.txt

This results in 151486 lines of output (which contain 3968 uniqe values, when inspected:)

$ ./t2 | wc -l
859

$ ./t2 | sort -u | wc -l
2

Because that is a pretty skewed benchmark, I benchmarked the other extreme as well:

$ cp names.txt misc.txt
$ time ./t | wc -l
98568

real    0m0.365s
user    0m0.372s
sys 0m0.228s

Tests performed with optimized builds on i386 (32bit) g++ 4.4.5, redirecting stdout to /dev/null and removing the getchar() call at the end

0

精彩评论

暂无评论...
验证码 换一张
取 消