开发者

algorithm to match prefix and name to a list of names

开发者 https://www.devze.com 2023-01-11 19:40 出处:网络
I have a std::vector<std::string> of all the files in a directory: // fileList folder/file1 folder/file2

I have a std::vector<std::string> of all the files in a directory:

// fileList
folder/file1
folder/file2
file3
file4.ext

and a std::set<std::string> of filenames and the same for all used folder prefixes:

// set1
file2
file4.ext

// set2
folder

I need to generate the full (relative) paths to the ALL files in set1, but see no way of doing that without iterating over set2 set1.size() times, multiplied by fileList.size()

UPDATE: some clarification:

Expected output 开发者_运维问答for above example:

folder/file2
file4.ext

Proposed (inefficient?) solution, maybe too verbose and with stupid implementation:

// pseudo-code!
vector<string> allpossibleFullPaths( set1.size()*set2.size() );
vector<string> output;
foreach( prefix_in_set2 )
    foreach( filename_in_set1 )
        allpossibleFullpaths.push_back( set2[i] + "/" set1[i] )

foreach( filename_in_fileList )
    files.push_back( find( fileList[i] in allpossibleFullPaths ) );

(fast pseudocode-ish) This seems very innefficient, is there a better way to make these matches?

Thanks!

PS: better still would be a way to keep track of doubles, so that I can warn the user about that.


One area which you are not clear about is this:

  • Given set1 & set2 as described above, what if fileList had "file4.ext" and "folder\file4.ext". Would you want both? Or is the list of file in set1 guaranteed to be unique?

Assuming that you'd want both, pseudo-code:

 foreach(pathname in fileList)
    separate pathname into path & filename.
    if path is not empty, but not in set2, skip to next pathname.
    if filename is in set1, output pathname.

Since a set lookup should be O(1), total complexity is O(2 * fileList.Length)

If the filenames in set1 are unique, you can count the number of pathnames output, and exit early when set1.Length is reached.

It may seem counter-intuitive to step through the longest collection, but it also has the slowest lookup, so operations on fileList have to be minimized.

UPDATE: Here's the full working C++ code (includes & usings elided)

void ListFiles()
{
    vector<string> fileList;
    fileList.push_back("folder/file1");
    fileList.push_back("folder/file2");
    fileList.push_back("file3");
    fileList.push_back("file4.ext");

    set<string> set1;
    set1.insert("file2");
    set1.insert("file4.ext");

    set<string> set2;
    set2.insert("folder");

    for(vector<string>::iterator iter = fileList.begin();
        iter != fileList.end();
        ++iter)
    {
        string pathname = *iter;
        string filename;
        string path;
        size_t pos = pathname.find('/');
        if (pos == string::npos || pos == 0)
            filename = pathname;
        else
        {
            path = pathname.substr(0, pos);
            if (set2.find(path) == set2.end())
                continue;
            filename = pathname.substr(pos+1);
        }
        if (set1.find(filename) != set1.end())
            cout << pathname << endl;
    }

}


Simple: iterate over fileList once, generate the prefix (set 2 entry) and file name (set 1 entry), and check if they are in their respective sets. If both are, you have a match, so return it; otherwise, return nothing for that item.

Also, this handles the 'doubles' problem you mention.


Just use a helper hash-table to get a runtime of set1.size() + fileList.size()

The Pseudo-Code:

unordered_set<string, list<string> > hash;
foreach (i in fileList):
  (fprex, fname) = split(i)
  hash[fname].push_back(fprex)
foreach (j in set1):
  a = hash.contains(j)
  if (a != hash.end())
    foreach(k in a)
       print k +'/' + j;

Or somethng like that. unordered_set is available in Boost (or tr1), and a insert/lookup operation is in O(1).


Your expected results look like you are searching for suffixes in FileList that match rows in set1 and set2 is immaterial.

The size of set2 decides which way to go for actual matching. If it's reasonably small you can turn it into a regex and either add regex anchors to match the end of string or pre-process FileList (by extracting just file name but also keeping the original string for the result). You can also reverse strings in both lists so that it becomes prefix matching indeed.

If set2 is big then you need to build hash-table from it and in this case you do need to pre-process FileList to extract file names as "keys" that you'll try to "find" in hash table. Make sure you handle case sensitivity if that's a potential issue (like converting all keys to upper case). With that in place just print out every row in FileList for which it's key is present in the hash table build from set 1.

If set 2 does have some significance (in which case your expected result is wrong) then that's the second pass to filter the results from the first pass - with another hash table for the 2nd filter.

0

精彩评论

暂无评论...
验证码 换一张
取 消