开发者

what is the better way to search in millions of file names with wildcard(GLOB) support

开发者 https://www.devze.com 2022-12-19 04:36 出处:网络
i am working on a small search engine to display a matching file names with full path. and important thing is that i need to provide wildcard(GLOB) search like *.doc or *list*.xlx or *timesheet* or ??

i am working on a small search engine to display a matching file names with full path. and important thing is that i need to provide wildcard(GLOB) search like *.doc or *list*.xlx or *timesheet* or ???.doc or something like that.

i found some related solution

Search for strings matching the pattern "abc:*:xyz" in less than O(n)

but i am looking for efficient algorithms which can find matches out of million file names in a less than a second, so better than O(n)开发者_运维知识库 is required..

i am thinking of two phase algorithm with substring array (Suffix array + prefix array) search in first phase and normal RegEx search thru the results of first phase second phase.

any help would be greatly appreciated...


Check out self indexing: This Stack Overflow question, and this DrDobbs article on it.


As far as I know there is no way to do better than O(n) for generalized glob searching.

However for the special cases of prefix and suffix search you can make yourself sorted indexes to do a binary search on, resulting in O(log n) for prefix and suffix search. The prefix index would be sorted based on the first character, then the second, and so on. The suffix index would be sorted based on the last character, then the second last, and so on (the reverse strings).

I would do as you suggest in your post and do the search in two phases, search the prefix or suffix indexes, followed by a brute force search through the reduced list provided from the first phase using a regex made from the glob.

Since string length comparisons are faster than regexes, I would also pre-filter for the minimum matching string length, or fixed length matching string for the ???.doc example.

From the sound of your original post the indexes would need to refer to the full path of each entry as well, so that you can display it after the final results are found.

0

精彩评论

暂无评论...
验证码 换一张
取 消