Search engines and databases allow you to use consecutive string search (such as "this is a test"
), which matches this is a test that will match
, but won't match test this is a
.
I know that some databases have built-in features that allow you to use the same functionality without writing a single line of code (e.g. MySQL's full text search). That's not the kind of answer I am looking for.
What I want to know is what kind of algorithm and database structures are used to enable this fast searching of strings.
What would the indexed table look like given the above example? Would it be something similar to this?
// IndexedItemID | Position | Word
1 | 0 | this开发者_运维技巧
1 | 1 | is
1 | 2 | a
1 | 3 | test
1 | 4 | that
1 | 5 | will
1 | 6 | match
2 | 0 | test
2 | 1 | this
2 | 2 | is
2 | 3 | a
Now that there are indexed items, how do you efficiently create a SQL statement that matches those items?
Here's one example I can think of:
select IndexedItemID form
(select IndexedItemID, Position from indexedWords where Word = "this") as word1Position
where
exists(select * from indexedWords where IndexedItemID = word1Position.IndexedItemID AND Word = "is" AND Position = word1Position.Position + 1)
AND exists(select * from indexedWords where IndexedItemID = word1Position.IndexedItemID AND Word = "a" AND Position = word1Position.Position + 2)
AND exists(select * from indexedWords where IndexedItemID = word1Position.IndexedItemID AND Word = "test" AND Position = word1Position.Position + 3)
I'm sure there is probably a more standardized way that's more efficient.
You probably want to look at Trie. They are very efficient in scenarios like this, but consumes a lot of memory to store the whole structure.
What you want is sorted inverted index of words from your document. Basically if your text is
"Here is an example sentence. This is how you index things" you turn this into:
Here: 1
is: 2, 7
an: 3
example: 4
......
......
Then when you are searching for a sequence of words, you lookup the list of positions for each word. Now you want to walk the list of sorted positions simultaneously, as if you were trying to merge the lists. while merging the lists it would be easy to spot anywhere where the list of words occur in the exact sequence you want them to.
I am not sure how the sql database would narrow down it's search but eventually it will come down to a string matching.
When you have a target string and a pattern string, the simplest way to do the comparison is to start at the beginning of the target string and try matching it with the pattern string character-by-character. If the match fails, you advance to the next character in the target string and repeat the above step. This is obviously inefficient because the complexity is of the order O(m*n) where m is the number of characters in the pattern string and n is the number of characters in the target string.
There is an algorithm called Rabin-Karp algorithm that can perform this search in O(m+n) using hashing.
Of course, mysql could have computed hashes that would would help reduce the number of target strings.
精彩评论